Data in Cassandra not consistent even with Quorum configuration - cassandra

I encountered a consistency problem using Hector and Cassandra when we have Quorum for both read and write.
I use MultigetSubSliceQuery to query rows from super column limit size 100, and then read it, then delete it. And start another around.
I found that the row which should be deleted by my prior query is still shown from next query.
And also from a normal Column Family, I updated the value of one column from status='FALSE' to status='TRUE', and the next time I queried it, the status was still 'FALSE'.
More detail:
It has not happened not every time (1/10,000)
The time between the two queries is around 500 ms (but we found one pair of queries in which 2 seconds had elapsed between them, still indicating a consistency problem)
We use ntp as our cluster time synchronization solution.
We have 6 nodes, and replication factor is 3
I understand that Cassandra is supposed to be "eventually consistent", and that read may not happen before write inside Cassandra. But for two seconds?! And if so, isn't it then meaningless to have Quorum or other consistency level configurations?
So first of all, is it the correct behavior of Cassandra, and if not, what data we need to analyze for further investment?

After check the source code with the system log, I found the root cause of the inconsistency.
Three factors cause the problem:
Create and update same record from different nodes
Local system time is not synchronized accurately enough (although we use NTP)
Consistency level is QUORUM
Here is the problem, take following as the event sequence
seqID NodeA NodeB NodeC
1. New(.050) New(.050) New(.050)
2. Delete(.030) Delete(.030)
First Create request come from Node C with local time stamp 00:00:00.050, assume requests first record in Node A and Node B, then later synchronized with Node C.
Then Delete request come from Node A with local time stamp 00:00:00.030, and record in node A and Node B.
When read request come, Cassandra will do version conflict merge, but the merge only depend on time stamp, so although Delete happened after Create, but the merge final result is "New" which has latest time stamp due to local time synchronization issue.

I also faced similar a issue. The issue occured because cassandra driver uses server timestamp by default to check which query is latest. However in latest version of cassandra driver they have changes it and now by default they are using client timestamp.
I have described the details of issue here

The deleted rows may be showing up as "range ghosts" because of the way that distributed deletes work: see http://wiki.apache.org/cassandra/FAQ#range_ghosts
If you are reading and writing individual columns both at CL_QUORUM, then you should always get full consistency, regardless of the time interval (provided strict ordering is still observed, i.e. you are certain that the read is always after the write). If you are not seeing this, then something, somewhere, is wrong.
To start with, I'd suggest a) verifying that the clients are syncing to NTP properly, and/or reproduce the problem with times cross-checked between clients somehow, and b) maybe try to reproduce the problem with CL_ALL.
Another thought - are your clients synced with NTP, or just the Cassandra server nodes? Remember that Cassandra uses the client timestamps to determine which value is the most recent.

I'm running into this problem with one of my clients/node. The other 2 clients I'm testing with (and 2 other nodes) run smoothly. I have a test that uses QUORUM in all reads and all writes and it fails very quickly. Actually some processes do not see anything from the others and others may always see data even after I QUORUM remove it.
In my case I turned on the logs and intended to test the feat with the tail -F command:
tail -F /var/lib/cassandra/log/system.log
to see whether I was getting some errors as presented here. To my surprise the tail process itself returned an error:
tail: inotify cannot be used, reverting to polling: Too many open files
and from another thread this means that some processes will fail opening files. In other words, the Cassandra node is likely not responding as expected because it cannot properly access data on disk.
I'm not too sure whether this is related to the problem that the user who posted the question, but tail -F is certainly a good way to determine whether the limit of files was reached.
(FYI, I have 5 relatively heavy servers running on the same machine so I'm not too surprise about the fact. I'll have to look into increasing the ulimit. I'll report here again if I get it fixed in this way.)
More info about the file limit and the ulimit command line option: https://askubuntu.com/questions/181215/too-many-open-files-how-to-find-the-culprit
--------- Update 1
Just in case, I first tested using Java 1.7.0-11 from Oracle (as mentioned below, I first used a limit of 3,000 without success!) The same error would popup at about the same time when running my Cassandra test (Plus even with the ulimit of 3,000 the tail -F error would still appear...)
--------- Update 2
Okay! That worked. I changed the ulimit to 32,768 and the problems are gone. Note that I had to enlarge the per user limit in /etc/security/limits.conf and run sudo sysctl -p before I could bump the maximum to such a high number. Somehow the default upper limit of 3000 was not enough even though the old limit was only 1024.

Related

Cassandra repairs on TWCS

We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?
So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.
Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html

Cassandra gossipinfo severity explained

I was unable to find a good documentation/explanation as to what severity indicates in nodetool gossipinfo. was looking for a detailed explanation but could not find a suitable one.
The severity is a value added to the latency in the dynamic snitch to determine which replica a coordinator will send the read's DATA and DIGEST requests to.
Its value would depend on the IO used in compaction and also it would try to read /proc/stat (same as the iostat utility) to get actual disk statistics as its weight. In post 3.10 versions of cassandra this is removed in https://issues.apache.org/jira/browse/CASSANDRA-11738. In pervious versions you can disable it by setting -Dcassandra.ignore_dynamic_snitch_severity in jvm options. The issue is that it weighting the io use the same as the latency. So if a node is GC thrashing and not doing much IO because of it, it could end up being treated as the target of most reads even though its the worst possible node to send requests to.
Now you can still use JMX to set the value still (to 1) if you want to exclude it from being used for reads. A example use case is using nodetool disablebinary so application wont query it directly, then setting the severity to 1. That node would then only be queried by cluster if theres a CL.ALL request or a read repair. Its a way to take a node "offline" for maintenance from a read perspective but still allow it to get mutations so it doesn't fall behind.
Severity reports activity that happens on the particular node (compaction, etc.), and this information then is used to make a decision on what node could better handle the request. There is discussion in original JIRA about this functionality & how this information is used.
P.S. Please see Chris's answer about changes in post 3.10 versions - I wasn't aware about these changes...

Could not read commit log descriptor in file

I started to use cassandra 3.7 and always I have problems with the commitlog. When the pc unexpected finished by a power outage for example the cassandra service doesn't restart. I try to start for the command line, but always the error cassandra could not read commit log descriptor in file appears.
I have to delete all the commit logs to start the cassandra service. The problem is that I lose a lot of data. I tried to increment the replication factor to 3, but is the same.
What I can do to decrease amount of lost data?
pd: I only one pc to use cassandra database, it is not possible to add more pcs.
I think your option here is to work around the issue since its unlikely there is a guaranteed solution to prevent commit table files getting corrupted on sudden power outage. Since you only have a single node, it makes it more difficult to recover the data. Increasing the replication factor to 3 on a single node cluster is not going to help.
One thing you can try is to reduce the frequency at which the memtables are flushed. On flush of memtable the entries in the commit log are discarded, therefore reducing the amount of data lost. Details here. This will however not resolve the root issue

Cassandra Deletes Not Consistently Working

I'm running Cassandra 2.2.1, 3 node cluster at a RF=3. If I perform simple deletes at quorum on a bunch of entries, verifying the results via a select at quorum reveals that some entries that should have been deleted persist in the table. The delete queries which were issued through the Java driver completed successfully without exception. I also use a retry policy to handle failed delete/writes but the policy for these failures is never invoked because they 'succeed'. I can reproduce the problem 100% of the time, it usually starts happening after I've issued around 100 deletes into the table. I understand how tombstones and gc grace period work and this is not a situation of resurected deletes. Read somewhere that it could be a ntp issue but all 3 nodes sync to the same clock and there's no drift as I can tell. I can share logs or anything else required to root cause. Thanks!
Update:
I resolved the problem and it seems to be a weird race condition that appears to either be time related or sequence related. If there is some time drift between nodes could be possible for the delete to be ignored if it was issued ahead of the insert from a tagged timestamp perspective.
E.G.
-insert is issued by node 1 at T1 (timestamp of node 1)
-delete comes into the system via node 3 but tagged with timestamp T0
-system concludes that insert occurred later so ignores delete
This gives the illusion that the delete executes ahead of insert depending on the timestamp sent by the respective nodes.
Allowing sufficient time between insert and delete resolved my issue although I'm not quite sure what the real root cause was.
Another option is to enable client side timestamps (instead of server side which is what you currently have).
If the same client issues the insert/update/delete it assures that the timestamps will be inline with the operation invocation.
using client side timestamps will remove the need to have a “sufficient time“ between insert/update and delete.
Please note that correct timestamp is also needed for cases in which two consective writes update the same “key“ (and this bugs are harder to detect :( ). Client side timestamps resolves such issues as well (given that the same client issues the requests)
How much time do you have between the delete and the select? As Cassandra has an "eventually consistent" behaviour, adding a delay between the delete and the select may solve the issue

Replication acknowledgement in PostgreSQL + BDR

I'm using libpq C Library for testing PG + BDR replica set. I'd like to get acknowledgement of the CRUD operations' replication. My purpose is to make my own log of the replication time in milliseconds or if possible in microseconds.
The program:
Starts 10-20 threads witch separate connections, each thread makes 1000-5000 cycles of basic CRUD operations on three tables.
Which would be the best way?
Parsing some high verbosity logs if they have proper data with time stamp or in my C api I should start N thread (N = {number of nodes} - {the master I'm connected to}) after every CRUD op. and query the nodes for the data.
You can't get replay confirmation of individual xacts easily. The system keeps track of the log sequence number replayed by peer nodes but not what transaction IDs those correspond to, since it doesn't care.
What you seem to want is near-synchronous or semi-synchronous replication. There's some work coming there for 9.6 that will hopefully benefit BDR in time, but that's well in the future.
In the mean time you can see the log sequence number as restart_lsn in pg_replication_slots. This is not the position the replica has replayed to, but it's the oldest point it might have to restart replay at after a crash.
You can see the other LSN fields like replay_location only when a replica is connected in pg_stat_replication. Unfortunately in 9.4 there's no easy way to see which slot in pg_replication_slots is associated with which active connection in pg_stat_replication (fixed in 9.5, but BDR is based on 9.4 still). So you have to use the application_name set by BDR if you want to pick out individual nodes, and it's ... "interesting" to parse. Also often truncated.
You can get the current LSN of the server you committed an xact on after committing it by calling SELECT pg_current_xlog_location(); which will return a value like 0/19E0F060 or whatever. You can then look that value up in the pg_stat_replication of peer nodes until you see that the replay_location for the node you committed on has reached or passed the LSN you captured immediately after commit.
It's not perfect. There could be other work done between when you commit and when you capture the server's current LSN. There's no way around that, but at worst you wait slightly too long. If you're using BDR you shouldn't be caring about micro or even milliseconds anyway, since it's an asynchronous replication solution.
The principles are pretty similar to measuring replication lag for normal physical standby servers, so I suggest reading some docs on that. Except that pg_last_xact_replay_timestamp() won't work for logical replication, so you can't get lag using that, you have to use the LSNs and do your own timing client-side.

Resources