I was unable to find a good documentation/explanation as to what severity indicates in nodetool gossipinfo. was looking for a detailed explanation but could not find a suitable one.
The severity is a value added to the latency in the dynamic snitch to determine which replica a coordinator will send the read's DATA and DIGEST requests to.
Its value would depend on the IO used in compaction and also it would try to read /proc/stat (same as the iostat utility) to get actual disk statistics as its weight. In post 3.10 versions of cassandra this is removed in https://issues.apache.org/jira/browse/CASSANDRA-11738. In pervious versions you can disable it by setting -Dcassandra.ignore_dynamic_snitch_severity in jvm options. The issue is that it weighting the io use the same as the latency. So if a node is GC thrashing and not doing much IO because of it, it could end up being treated as the target of most reads even though its the worst possible node to send requests to.
Now you can still use JMX to set the value still (to 1) if you want to exclude it from being used for reads. A example use case is using nodetool disablebinary so application wont query it directly, then setting the severity to 1. That node would then only be queried by cluster if theres a CL.ALL request or a read repair. Its a way to take a node "offline" for maintenance from a read perspective but still allow it to get mutations so it doesn't fall behind.
Severity reports activity that happens on the particular node (compaction, etc.), and this information then is used to make a decision on what node could better handle the request. There is discussion in original JIRA about this functionality & how this information is used.
P.S. Please see Chris's answer about changes in post 3.10 versions - I wasn't aware about these changes...
Related
We have a Cassandra 2.0.17 cluster with 3 DCs, where each DC has 8 nodes and RF of 3. We have not been running regular repairs on it.
One node has been down for 2 months due to hardware issue with one of the drives.
We finally got a new drive to replace the faulty one, and are trying to figure out the best way to bring the node back into the cluster.
We initially thought to just run nodetool repair but from my research so far it seems like that would only be good if the node was down for less than gc_grace_seconds which is 10 days.
Seems like that would mean removing the node and then adding it back in as a new node.
Someone mentioned somewhere that rather than completely removing the node and then bootstrapping it back in, I could potentially use the same procedure used for replacing a node, using the replace_address flag (or replace_address_first_boot if available), to replace the node with itself. But I couldn't find any real documentation or case studies of doing this.
It seems like this is not a typical situation - normally, either a node goes down for a short period of time and you can just run repair on it, or it needs to be replaced altogether. But it's hard to find much prior art on our exact use case.
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster?
Is repair really not a good option here?
Also, whatever the answer is, how would I monitor the process and ensure that it's successful?
So here's what I would do:
If you haven't already, run a removenode on the "dead" node's host ID.
Fire-up the old node, making sure that it is not a seed node and that auto_bootstrap is either true or not specified. It defaults to true unless explicitly set otherwise.
It should join right back in, and re-stream its data.
You can monitor it's progress by running nodetool netstats | grep Already, which returns a status by each node streaming, specifying completion progress in terms of # of files streamed vs. total files.
The advantage of doing it this way, is that the node will not attempt to serve requests until bootstrapping is completed.
If you run into trouble, feel free to comment here or ask for help in the cassandra-admins channel on DataStax's Discord server.
You have mentioned already that you are aware that node has to be removed if it is down for more than gc_grace_seconds
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster? Is repair really not a good option here?
So the answer is that only. You cannot safely bring that node back if it is down more than gc_grace_seconds. It needs to be removed to prevent possible deleted data from appearing back.
https://stackoverflow.com/a/69098765/429476
From https://community.datastax.com/questions/3987/one-of-my-nodes-powered-off.html
Erick Ramirez answered • May 12 2020 at 1:19 PM | Erick Ramirez edited • Dec 03 2021 at 4:49 AM BEST ANSWERACCEPTED ANSWER
#cache_drive If the node has been down for less than the smallest gc_grace_seconds, it should be as simple as starting Cassandra on the node then running a repair on it.
If the node has been down longer than the smallest GC grace, you will need to wipe the node clean including deleting all the contents of data/, commitlog/ and saved_caches/. Then replace the node "with itself" by adding the replace_address flag and specifying its own IP. For details, see Replacing a dead node. Cheers!
How to find out when a Cassandra table becomes "eventually consistent"? Is there a definitive way to determine this at a given point in time? Preferably programatically through the Datastax driver API? I checked out the responses to the following related questions but there does not seem to be anything more concrete than "check the nodetool netstats output"
Methods to Verify Cassandra Node Sync
how do i know if nodetool repair is finished
If your system is always online doing operations then it may never become full consistent at single point of time untill you are on Consistency level "ALL".
Repairs process logs error in log file if it does not get reply from other replica nodes cause they were down/timeout etc.
you can check the logs if no error WRT AntiEntropy/stream it means your system is almost consistence.
We have a 20 nodes Cassandra cluster running a lot of read requests (~900k/sec at peak). Our dataset is fairly small, so everything is served directly from memory (OS Page Cache). Our datamodel is quite simple (just a key/value) and all of our reads are performed with consistency level one (RF 3).
We use the Java Datastax driver with TokenAwarePolicy, so all of our reads should go directly to one node that has the requested data.
These are some metrics extracted from one of the nodes regarding client read request latency and local read latency.
org_apache_cassandra_metrics_ClientRequest_50thPercentile{scope="Read",name="Latency",} 105.778
org_apache_cassandra_metrics_ClientRequest_95thPercentile{scope="Read",name="Latency",} 1131.752
org_apache_cassandra_metrics_ClientRequest_99thPercentile{scope="Read",name="Latency",} 3379.391
org_apache_cassandra_metrics_ClientRequest_999thPercentile{scope="Read",name="Latency",} 25109.16
org_apache_cassandra_metrics_Keyspace_50thPercentile{keyspace=“<keyspace>”,name="ReadLatency",} 61.214
org_apache_cassandra_metrics_Keyspace_95thPercentile{keyspace="<keyspace>",name="ReadLatency",} 126.934
org_apache_cassandra_metrics_Keyspace_99thPercentile{keyspace="<keyspace>",name="ReadLatency",} 182.785
org_apache_cassandra_metrics_Keyspace_999thPercentile{keyspace="<keyspace>",name="ReadLatency",} 454.826
org_apache_cassandra_metrics_Table_50thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 105.778
org_apache_cassandra_metrics_Table_95thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 1131.752
org_apache_cassandra_metrics_Table_99thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 3379.391
org_apache_cassandra_metrics_Table_999thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 25109.16
Another important detail is that most of our queries (~70%) don't return anything, i.e., they are for records not found. So, bloom filters play an important role here and they seem to be fine:
Bloom filter false positives: 27574
Bloom filter false ratio: 0.00000
Bloom filter space used:
Bloom filter off heap memory used: 6760992
As it can be seen, the reads in each one of the nodes are really fast, the 99.9% is less than 0.5 ms. However, the client request latency is way higher, going above 4ms on the 99%. If I'm reading with CL ONE and using TokenAwarePolicy, shouldn't both values be similar to each other, since no coordination is required? Am I missing something? Is there anything else I could check to see what's going on?
Thanks in advance.
#luciano
there are various reasons why the coordinator and the replica can report different 99th percentiles for read latencies, even with token awareness configured in the client.
these can be anything that manifests in between the coordinator code to the replica's storage engine code in the read path.
examples can be:
read repairs (not directly related to a particular request, as is asynchronous to the read the triggered it, but can cause issues),
host timeouts (and/or speculative retries),
token awareness failure (dynamic snitch simply not keeping up),
GC pauses,
look for metrics anomalies per host, overlaps with GC, and even try to capture traces for some of the slower requests and investigate if they're doing everything you expect from C* (eg token awareness).
well-tuned and spec'd clusters may also witness the dynamic snitch simply not being able to keep up and do its intended job. in such situations disabling the dynamic snitch can fix the high latencies for top-end read percentiles. see https://issues.apache.org/jira/browse/CASSANDRA-6908
be careful though, measure and confirm hypotheses, as mis-applied solutions easily have negative effects!
Even if using TokenAwarePolicy, the driver can't work with the policy when the driver doesn't know which partition key is.
If you are using simple statements, no routing information is provided. So you need additional information to the driver by calling setRoutingKey.
The DataStax Java Driver's manual is a good friend.
http://docs.datastax.com/en/developer/java-driver/3.1/manual/load_balancing/#requirements
If TokenAware is perfectly working, CoordinatorReadLatency value is mostly same value with ReadLatency. You should check it too.
http://cassandra.apache.org/doc/latest/operating/metrics.html?highlight=coordinatorreadlatency
thanks for your reply and sorry about the delay in getting back to you.
One thing I’ve found out is that our clusters had:
dynamic_snitch_badness_threshold=0
in the config files. Changing that to the default value (0.1) helped a lot in terms of the client request latency.
The GC seems to be stable, even under high load. The pauses are constant (~10ms / sec) and I haven’t seen spikes (not even full gcs). We’re using CMS with a bigger Xmn (2.5GB).
Read repairs happen all the time (we have it set to 10% chance), so when the system is handling 800k rec/sec, we have ~80k read repairs/sec happening in background.
It also seems that we’re asking too much for the 20 machines cluster. From the client point of view, latency is quite stable until 800k qps, after that it starts to spike a little bit, but still under a manageable threshold.
Thanks for all the tips, the dynamic snitch thing was really helpful!
Sorry if this is an existing question, but any of the existing ones resolved my problem..
I've installed Cassandra single noded. I don't have a large application right now, but I think this can be the case soon, and I will need more and more nodes..
Well, I'm saving data from a stream to Cassandra, and this were going well, but suddently, when I tried to read data, I've started to receive this error:
"Not enough replica available for query at consistency ONE (1 required but only 0 alive)"
My keyspace was built using simplestrategy with replication_factor = 1. Im saving data separated by a field called "catchId", so most of my queries are like: "select * from data where catchId='xxx'". catchId is a partition key.
I'm using the cassandra-driver-core version 3.0.0-rc1.
The thing is that I don't have that much of data rigth now, and I'm thinking if it will be better to use a RDBMS for now, and migrate to Cassandra only when I have a better infrastructure.
Thanks :)
It seems that your node is unable to respond when you try to make your read (in general this error appears for more than one node). If you do not have lots of data, it's very strange, so this is probably a bad design choice. This can emanate from several things, so you have to make a few investigations.
study your logs ! In particular the system.log
you can change your read_request_timeout_in_ms parameter in cassandra.yaml. Although it's not agood idea in production, it will say you if it's just temporary problem (your request succeed after a little time) or a bigger problem
study your CPU and memory behavior when you are doing requests
if you are very motivated, you can install opscenter which will you give more valuable informations
How and how many write requests are you doing ? It can overwhelm cassandra (even if it's designed for). I recommend to make async requests to avoid problems.
I encountered a consistency problem using Hector and Cassandra when we have Quorum for both read and write.
I use MultigetSubSliceQuery to query rows from super column limit size 100, and then read it, then delete it. And start another around.
I found that the row which should be deleted by my prior query is still shown from next query.
And also from a normal Column Family, I updated the value of one column from status='FALSE' to status='TRUE', and the next time I queried it, the status was still 'FALSE'.
More detail:
It has not happened not every time (1/10,000)
The time between the two queries is around 500 ms (but we found one pair of queries in which 2 seconds had elapsed between them, still indicating a consistency problem)
We use ntp as our cluster time synchronization solution.
We have 6 nodes, and replication factor is 3
I understand that Cassandra is supposed to be "eventually consistent", and that read may not happen before write inside Cassandra. But for two seconds?! And if so, isn't it then meaningless to have Quorum or other consistency level configurations?
So first of all, is it the correct behavior of Cassandra, and if not, what data we need to analyze for further investment?
After check the source code with the system log, I found the root cause of the inconsistency.
Three factors cause the problem:
Create and update same record from different nodes
Local system time is not synchronized accurately enough (although we use NTP)
Consistency level is QUORUM
Here is the problem, take following as the event sequence
seqID NodeA NodeB NodeC
1. New(.050) New(.050) New(.050)
2. Delete(.030) Delete(.030)
First Create request come from Node C with local time stamp 00:00:00.050, assume requests first record in Node A and Node B, then later synchronized with Node C.
Then Delete request come from Node A with local time stamp 00:00:00.030, and record in node A and Node B.
When read request come, Cassandra will do version conflict merge, but the merge only depend on time stamp, so although Delete happened after Create, but the merge final result is "New" which has latest time stamp due to local time synchronization issue.
I also faced similar a issue. The issue occured because cassandra driver uses server timestamp by default to check which query is latest. However in latest version of cassandra driver they have changes it and now by default they are using client timestamp.
I have described the details of issue here
The deleted rows may be showing up as "range ghosts" because of the way that distributed deletes work: see http://wiki.apache.org/cassandra/FAQ#range_ghosts
If you are reading and writing individual columns both at CL_QUORUM, then you should always get full consistency, regardless of the time interval (provided strict ordering is still observed, i.e. you are certain that the read is always after the write). If you are not seeing this, then something, somewhere, is wrong.
To start with, I'd suggest a) verifying that the clients are syncing to NTP properly, and/or reproduce the problem with times cross-checked between clients somehow, and b) maybe try to reproduce the problem with CL_ALL.
Another thought - are your clients synced with NTP, or just the Cassandra server nodes? Remember that Cassandra uses the client timestamps to determine which value is the most recent.
I'm running into this problem with one of my clients/node. The other 2 clients I'm testing with (and 2 other nodes) run smoothly. I have a test that uses QUORUM in all reads and all writes and it fails very quickly. Actually some processes do not see anything from the others and others may always see data even after I QUORUM remove it.
In my case I turned on the logs and intended to test the feat with the tail -F command:
tail -F /var/lib/cassandra/log/system.log
to see whether I was getting some errors as presented here. To my surprise the tail process itself returned an error:
tail: inotify cannot be used, reverting to polling: Too many open files
and from another thread this means that some processes will fail opening files. In other words, the Cassandra node is likely not responding as expected because it cannot properly access data on disk.
I'm not too sure whether this is related to the problem that the user who posted the question, but tail -F is certainly a good way to determine whether the limit of files was reached.
(FYI, I have 5 relatively heavy servers running on the same machine so I'm not too surprise about the fact. I'll have to look into increasing the ulimit. I'll report here again if I get it fixed in this way.)
More info about the file limit and the ulimit command line option: https://askubuntu.com/questions/181215/too-many-open-files-how-to-find-the-culprit
--------- Update 1
Just in case, I first tested using Java 1.7.0-11 from Oracle (as mentioned below, I first used a limit of 3,000 without success!) The same error would popup at about the same time when running my Cassandra test (Plus even with the ulimit of 3,000 the tail -F error would still appear...)
--------- Update 2
Okay! That worked. I changed the ulimit to 32,768 and the problems are gone. Note that I had to enlarge the per user limit in /etc/security/limits.conf and run sudo sysctl -p before I could bump the maximum to such a high number. Somehow the default upper limit of 3000 was not enough even though the old limit was only 1024.