Side effects of Cassandra hinted handoff lead to inconsistency - cassandra

I have a 3 nodes cluster, replicate_factor is 3 also. Consistency level is Write quorum, Read quorum.
Traffic has three major steps
Create:
Rowkey: xxxx
Column: status=new, requests="xxxxx"
Update:
Rowkey: xxxx
Column: status=executing, requests="xxxxx"
Delete:
Rowkey: xxxx
When one node down, it can work according to consistency configuration, and the final status is all requests are finished and deleted.
So if running cassandra client to list the result (also set consistency quorum). It shows empty (only rowkey left), which is correct.
But if we start the dead node, the hinted handoff model will write back the data to this node. So there are lots of create, update, delete.
I don't know due to GC or compaction, the delete records on other two nodes seems not work, and if using cassandra client to list the data (also consistency quorum), the deleted row show again with column value. Due to the recovery node replay the history again.
And if using client to check the data several times, you can find the data is changed, seems hinted handoff replay operation, the deleted data show up and then disappear.
Is there a way to have this procedure invisible from external, until the hinted handoff finished?
What I want is final status synchronization, the temporary status is out of date and also incorrect, should never been seen from external.
Is it due to row delete instead of column delete? Or compaction?

After check the log and configuration, I found it caused by two reason.
GC grace seconds
I using hector client to connect cassandra, and the default value of GC grace seconds for each column family is Zero! So when hinted handoff replay the temporary value, the tombstone on other two node is deleted by compaction. And then client will get the temporary value.
Secondary index
Even after fix the first problem, I can still get temporary result from cassandra client. And I use the command like "get my_cf where column_one='value' " to query the data, then the temporary value show again. But when I using the raw key to query the record again, it disappeared.
And from client, we always using row key to get the data, and in this way, I didn't get the temporary value.
So it seems the secondary index is not restricted by the consistency configuration.
And when I change GC grace seconds to 10 days. our problem solved, but it is still a strange behavior when using index query.

Related

In Cassandra, why does a single value take precedence over quorum nodes empty response?

In Cassandra, tombstones are used in deletion since the writes are written to immutable files. I read that tombstones also solve the tough problem of deleting in distributed systems. This is where I am confused. What problems exist in deleting from distributed databases? For eg: Take a 3 node cluster with nodes A, B and C. Say node C is down and a delete came. It is marked as tombstone in A and B and success is returned back to client. After sometime the compaction kicks in on A and B and clears out this tombstone. Now when a read comes for the previously deleted value, A and B return nothing while C returns the old value. But here I read that the value given by C takes precedence over the empty responses.
If the tombstoned record has already been deleted from the rest of the cluster before that node recovers, Cassandra treats the record on the recovered node as new data, and propagates it to the rest of the cluster.
Why does it do this? Since quorum nodes say the value is not present, why don't we return that back to the client? This could potentially simplify the problem of deletes in distributed systems as we needn't wait for gc grace seconds before clearing out the tombstones.
The quorum returns nothing could also mean that the rest of the nodes simply didn't receive value because the nodes were down, so in this case the single node having data is correct, and this value will be propagated to the nodes that doesn't have it. Cassandra simply don't know, if the data is missing because it was deleted via tombstone vs. data is missing because nodes weren't available at the time of write.
That's why it's important to run repairs regularly and make sure that this happens during gc_grace_seconds. And that you didn't put back machine after being offline greater than this period.

Will Cassandra reach eventual consistency without manual repair if there is no read for that data during gc.grace.seconds?

Assume the following
Replication factor is 3
A delete was issued with consistency 2
One of the replica was busy (not down) so it drops the request
The other two replicas add the tombstone and send the response. So currently the record is marked for deletion in only two replicas.
There is no read repair happened as there was no read for that data gc.grace.seconds
Q1.
Will this data be resurrected when a read happens for that record after gc.grace.seconds if there was no manual repair?
(I am not talking about replica being down for more than gc.grace.seconds)
One of the replica was busy (not down) so it drops the request
In this case, the coordinator node realizes that the replica could not be written and stores it as a hint. Once the overwhelmed node starts taking requests again, the hint is "replayed" to get the replica consistent.
However, hints are only kept (by default) for 3 hours. After that time, they are dropped. So, if the busy node does not recover within that 3 hour window, then it will not be made consistent. And yes, in that case a query at consistency-ONE could allow that data to "ghost" its way back.

Write consistency = ALL: do I still need weekly repairs to avoid zombie records?

As far as I understood, the problem of deleted data reappearing in Cassandra is as follows:
A delete is issued with consistency < ALL (e.g. QUORUM)
The delete succeeds, but some nodes in the replication set were not reachable during the delete
A tombstone is written to all the reached nodes, nothing in the others
10 days pass, tombstone are eligible to be expired
Compactions happen, tombstones are actually removed
A read is issued: the nodes which received the delete reply with "no data"; the nodes which were unavailable during the delete reply with the old data; a zombie is produced
Now my question is: if the original delete was issued with consistency = ALL, all the nodes would either have the tombstone (before expiry&compaction) or no data at all (after expiry&compaction). No zombies should then be produced, even if we did not issue a repair before tombstone expiry.
Is this correct?
Yes you still need to run repairs even with CL.ALL on the delete if you want to guarantee no resurrected data. You just decrease likelihood of it occurring without you noticing it.
If a node is unavailable for the delete, the delete will fail for the client (because cl.all) but the other nodes all still received the delete. Even if your app will retry the delete theres a chance of it failing (ie your app's server hit by a meteor). So then you have a delete that has been seen by 2 of your 3 replicas. If you lowered your gc_grace and don't run repairs the other anti-entropy measures (hints, read repairs) may not ensure the tombstone (they are best effort not guarantee) was seen by the 3rd node before the tombstone is compacted away. The next read touches 3rd node which has the original data, and no tombstone exists to say it was deleted so you resurrect the data as its read repaired to other replicas.
What you can do is log a statement somewhere to point when there is a cl.all timeout or failure. This is not a guarantee since your app can die before the log, and a failure does not actually mean that the write did not get to all replicas - just that it may of failed to write. That said I would strongly recommend just using quorum (or local_quorum). That way you can have some host failures without losing availability since you need the repairs for the guarantee anyway.
When issuing queries with Consistency=ALL, every node having the token range of that particular record has to acknowledge. So if one of the NODE was down during this process, the DELETE will fail as it can't achieve the required consistency=ALL.
So consistency=ALL, might end up being a scenario where every node in the cluster has to stay up otherwise queries will fail. That's why people recommend to use lesser stronger consistency like QUORUM. So you are sacrificing high availability for REPAIRs if you want to perform queries at CONSISTENCY=ALL

Partition DELETE/INSERT concurrency issue in Cassandra

I have a table in Cassandra which stores versions of csv-files. It uses a primary key with a unique id for the version (the partition key) and a row number (the clustering key). When I insert a new version I first execute a delete statement on the partition key I am about to insert, to clean up any incomplete data. Then the data is inserted.
Now here is the issue. Even though the delete and subsequent insert are executed synchronously after one another in the application it seems that some level of concurrency still exist in Cassandra, because when I read afterwards, rows from my insert will be missing occasionally - something like 1 in 3 times. Here are some facts:
Cassandra 3.0
Consistency ALL (R+W)
Delete using the Java Driver
Insert using the Spark-Cassandra connector
Number of nodes: 2
Replication factor: 2
The delete statement I execute looks like this:
"DELETE FROM myTable WHERE version = 'id'"
If I omit it, the problem goes away. If I insert a delay between the delete and the insert the problem is reduced (less rows missing). Initially I used a less restrictive consistency level, and I was sure this was the issue, but it didn't affect the problem. My hypothesis is that for some reason the delete statement is being sent to the replica asynchronously despite the consistency level of ALL, but I can't see why this would be the case or how to avoid it.
All mutations are going to by default get a write time of the coordinator for that write. From the docs
TIMESTAMP: sets the timestamp for the operation. If not specified,
the coordinator will use the current time (in microseconds) at the
start of statement execution as the timestamp. This is usually a
suitable default.
http://cassandra.apache.org/doc/cql3/CQL.html
Since the coordinator for different mutations can be different, a clock skew between coordinators can end up with a mutations to one machine to be skewed relative to another.
Since write time controls C* history this means you can have a driver which synchronously inserts and deletes but depending on the coordinator the delete can happen "before" the insert.
Example
Imagine two nodes A and B, B is operating with a 5 second clock skew behind A.
At time 0: You insert data to the cluster and A is chosen as the coordinator. The mutation arrives at A and A assigns a timestamp (0)
There is now a record in the cluster
INSERT VALUE AT TIME 0
Both nodes contain this message and the request returns confirming the write was successful.
At time 2: You issue a delete for the data previously inserted and B is chosen as the coordinator. B assigns a timestamp of (-3) because it is clock skewed 5 seconds behind the time in A. This means that we end up with a statement like
DELETE VALUE AT TIME -3
We acknowledge that all nodes have received this record.
Now the global consistent timeline is
DELETE VALUE AT TIME -3
INSERT VALUE AT TIME 0
Since the insertion occurs after the delete the value still exists.
I have got similar problem, and I have fixed it by enabling Light-Weight-Transaction for both INSERT and DELETE requests (for all queries actually, including UPDATE). It will make sure all queries to this partition are serialized through one "thread", so DELETE wan't overwrite INSERT. For example (assuming instance_id is a primary key):
INSERT INTO myTable (instance_id, instance_version, data) VALUES ('myinstance', 0, 'some-data') IF NOT EXISTS;
UPDATE myTable SET instance_version=1, data='some-updated-data' WHERE instance_id='myinstance' IF instance_version=0;
UPDATE myTable SET instance_version=2, data='again-some-updated-data' WHERE instance_id='myinstance' IF instance_version=1;
DELETE FROM myTable WHERE instance_id='myinstance' IF instance_version=2
//or:
DELETE FROM myTable WHERE instance_id='myinstance' IF EXISTS
IF clauses enable light-wight-transactions for each row, so all of them are serialized. Warning: LWT is more expensive than normal calls, but sometimes they are needed, like in the case of this concurrency problem.

Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds. (Queue)

Have a table set up in Cassandra that is set up like this:
Primary key columns
shard - an integer between 1 and 1000
last_used - a timestamp
Value columns:
value - a 22 character string
Example if how this table is used:
shard last_used | value
------------------------------------
457 5/16/2012 4:56pm NBJO3poisdjdsa4djmka8k >-- Remove from front...
600 6/17/2013 5:58pm dndiapas09eidjs9dkakah |
...(1 million more rows) |
457 NOW NBJO3poisdjdsa4djmka8k <-- ..and put in back
The table is used as a giant queue. Very many threads are trying to "pop" the row off with the lowest last_used value, then update the last_used value to the current moment in time. This means that once a row is read, since last_used is part of the primary key, that row is deleted, then a new row with the same shard, value, and updated last_used time is added to the table, at the "end of the queue".
The shard is there because so many processes are trying to pop the oldest row off the front of the queue and put it at the back, that they would severely bottleneck each other if only one could access the queue at the same time. The rows are randomly separated into 1000 different "shards". Each time a thread "pops" a row off the beginning of the queue, it selects a shard that no other thread is currently using (using redis).
Holy crap, we must be dumb!
The problem we are having is that this operation has become very slow on the order of about 30 seconds, a virtual eternity.
We have only been using Cassandra for less than a month, so we are not sure what we are doing wrong here. We have gotten some indication that perhaps we should not be writing and reading so much to and from the same table. Is it the case that we should not be doing this in Cassandra? Or is there perhaps some nuance in the way we are doing it or the way that we have it configured that we need to change and/or adjust? How might be trouble-shoot this?
More Info
We are using the MurMur3Partitioner (the new random partitioner)
The cluster is currently running on 9 servers with 2GB RAM each.
The replication factor is 3
Thanks so much!
This is something you should not use Cassandra for. The reason you're having performance issues is because Cassandra has to scan through mountains of tombstones to find the remaining live columns. Every time you delete something Cassandra writes a tombstone, it's a marker that the column has been deleted. Nothing is actually deleted from disk until there is a compaction. When compacting Cassandra looks at the tombstones and determines which columns are dead and which are still live, the dead ones are thrown away (but then there is also GC grace, which means that in order to avoid spurious resurrections of columns Cassandra keeps the tombstones around for a while longer).
Since you're constantly adding and removing columns there will be enormous amounts of tombstones, and they will be spread across many SSTables. This means that there is a lot of overhead work Cassandra has to do to piece together a row.
Read the blog post "Cassandra anti-patterns: queues and queue-like datasets" for some more details. It also shows you how to trace the queries to verify the issue yourself.
It's not entirely clear from your description what a better solution would be, but it very much sounds like a message queue like RabbitMQ, or possibly Kafka would be a much better solution. They are made to have a constant churn and FIFO semantics, Cassandra is not.
There is a way to make the queries a bit less heavy for Cassandra, which you can try (although I still would say Cassandra is the wrong tool for this job): if you can include a timestamp in the query you should hit mostly live columns. E.g. add last_used > ? (where ? is a timestamp) to the query. This requires you to have a rough idea of the first timestamp (and don't do a query to find it out, that would be just as costly), so it might not work for you, but it would take some of the load off of Cassandra.
The system appears to be under stress (2GB or RAM may be not enough).
Please have nodetool tpstats run and report back on its results.
Use RabbitMQ. Cassandra is probably a bad choice for this application.

Resources