Trying to set up a cassandra cluster, but it tells me that I have a SchemaDisagreementException. What is weird is that writing sometimes works, just not all of the time. I assume that I must somewhere have different schemas, but I've cleared my data directories several times before, so it must not be in there. Where else would my schema be declared, other than in my code?
You've inspired a FAQ entry! :)
http://wiki.apache.org/cassandra/FAQ#schema_disagreement
(This didn't exist until 10 minutes ago, so no, you didn't miss it.)
FAQ didn't helped me in getting this issue resolved, instead fuzzed me. I have document all my concerns and reasons of confusion at http://nsinfra.blogspot.in/2013/06/cassandra-schema-disagreement-problem.html
Anyways I got rid of this problem by synchronizing the clocks of cluster nodes.
I'm seeing this too, about 1 in 3 times.
I'm able to reproduce it on a clean cluster -- no operations performed apart from starting the nodes -- and then listing schema versions confirms different some nodes already sometimes have different schemas.
I get the error even with just 2 nodes, seeds on both nodes set to the two nodes' IP addresses. Occurs starting concurrently or with a 2m delay before starting the second node.
This is Cassandra version 1.2.2. Times on the machines are in sync (<1s difference).
Related
I have cluster about 100 nodes and it grows. I need to add 10-50 on request. As I know by default cassandra has cassandra.consistent.rangemovement=true this means multiple nodes can't to bootstrap in a moment.
Anyway when I add many nodes using Terraform and some kind of default configuration (using Puppet) at least 2-3 becomes UJ state and eventually only one bootstrap successfully. Earlier I used random time delay before start cassandra.service, but it doesn't work adding 10+ nodes.
I'm trying to figure out how to implement kind of "lock" for bootstrap.
I have Consul and can get kind of lock for bootstrap in KV. For instance get lock using ExecPreStart systemd feature but I can't get how to release it after bootstrap.
I'm looking for any solutions for that.
I've done something similar using Rundeck before. Basically, we had Rundeck kick off a bash script, taking parameters about the deployment of our nodes as well as how many.
What we did, was parse the output of nodetool status. We'd count the number of nodes as well as the number of UN indicators. If those two numbers didn't match, we'd do a sleep 30s and try again.
Once those numbers matched, we knew that it was safe to add another node. The total operation could take a while to add all nodes, but it worked.
We have a Cassandra 2.0.17 cluster with 3 DCs, where each DC has 8 nodes and RF of 3. We have not been running regular repairs on it.
One node has been down for 2 months due to hardware issue with one of the drives.
We finally got a new drive to replace the faulty one, and are trying to figure out the best way to bring the node back into the cluster.
We initially thought to just run nodetool repair but from my research so far it seems like that would only be good if the node was down for less than gc_grace_seconds which is 10 days.
Seems like that would mean removing the node and then adding it back in as a new node.
Someone mentioned somewhere that rather than completely removing the node and then bootstrapping it back in, I could potentially use the same procedure used for replacing a node, using the replace_address flag (or replace_address_first_boot if available), to replace the node with itself. But I couldn't find any real documentation or case studies of doing this.
It seems like this is not a typical situation - normally, either a node goes down for a short period of time and you can just run repair on it, or it needs to be replaced altogether. But it's hard to find much prior art on our exact use case.
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster?
Is repair really not a good option here?
Also, whatever the answer is, how would I monitor the process and ensure that it's successful?
So here's what I would do:
If you haven't already, run a removenode on the "dead" node's host ID.
Fire-up the old node, making sure that it is not a seed node and that auto_bootstrap is either true or not specified. It defaults to true unless explicitly set otherwise.
It should join right back in, and re-stream its data.
You can monitor it's progress by running nodetool netstats | grep Already, which returns a status by each node streaming, specifying completion progress in terms of # of files streamed vs. total files.
The advantage of doing it this way, is that the node will not attempt to serve requests until bootstrapping is completed.
If you run into trouble, feel free to comment here or ask for help in the cassandra-admins channel on DataStax's Discord server.
You have mentioned already that you are aware that node has to be removed if it is down for more than gc_grace_seconds
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster? Is repair really not a good option here?
So the answer is that only. You cannot safely bring that node back if it is down more than gc_grace_seconds. It needs to be removed to prevent possible deleted data from appearing back.
https://stackoverflow.com/a/69098765/429476
From https://community.datastax.com/questions/3987/one-of-my-nodes-powered-off.html
Erick Ramirez answered • May 12 2020 at 1:19 PM | Erick Ramirez edited • Dec 03 2021 at 4:49 AM BEST ANSWERACCEPTED ANSWER
#cache_drive If the node has been down for less than the smallest gc_grace_seconds, it should be as simple as starting Cassandra on the node then running a repair on it.
If the node has been down longer than the smallest GC grace, you will need to wipe the node clean including deleting all the contents of data/, commitlog/ and saved_caches/. Then replace the node "with itself" by adding the replace_address flag and specifying its own IP. For details, see Replacing a dead node. Cheers!
Recently I faced an issue in a customer setup with a 3 node cluster, where one node went down and came online only after 12 days. The default gc_grace_seconds for most of the table has been set to 1 day in our scenario and there are a lot of tables.
When this down node came up, stale data from this node got replicated to the other nodes leading to zombie data in all the three nodes.
A solution that I could think of was to clean the node before making it join the cluster and then run a repair which could prevent the occurrence of zombie data.
Could there be any other possible solution to avoid this issue where I don't need to clean the node.
You should never bring a node back online if it has been down for longer than the shortest gc_grace_seconds.
This is a challenge in environments where GC grace is set to a very low value. In these situations, the procedure is to completely rebuild the node as if it was never part of the cluster:
Completely wipe all contents of data/, commitlog/ and saved_caches/.
Remove the node's IP from its seeds list if it is listed as a seed node.
Replace the node with itself using the replace_address flag.
Cheers!
We recently started having problems with our Cassandra cluster. Maybe someone has ideas on how to fix this. We're running Cassandra 3.11.7 on a 40 node cluster. We are using replication factor = 3 and read/write at consistency level QUORUM.
Recently, a single node experienced a sudden spike in CPU load which then last for a while. During that period, we can observe a lot of dropped and queued MUTATIONs. If we restart Cassandra on the problematic node, one or two other nodes start to suffer of the same problem. We have examined log files and access patterns and have not yet been able to find the reason.
What could be the most common reasons for such behaviour? Where should we take a closer look? Has anyone already had similar experiences?
If we restart Cassandra on the problematic node, one or two other nodes start to suffer of the same problem.
First of all, when a single node presents a problem, restarting it generally achieves nothing. If anything, you'll clear the JVM heap...which will be quickly repopulated upon startup. Seriously, don't expect restarting a node to fix anything.
Has anyone already had similar experiences?
Yes, several times. For things not Cassandra related:
Are you in a cloud environment? Run iostat and look for things like high percentages of iowait and steal. Sometimes shared resources don't play well with others. If you don't have iostat, get it (yum install -y sysstat).
Check cron for all users. We once had an issue with a file integrity checker getting installed as a part of our base image, and it did exactly what you are talking about.
What could be the most common reasons for such behaviour? Where should we take a closer look?
For Cassandra related issues, I see a few possibilities:
Repairs. Check if the node is running a repair. You can see Merkle Tree calculations with nodetool compactionstats and repair streams with nodetool netstats.
Compactions. Check nodetool compactionstats. If this is it, you can try lowering your compaction throughput so that it doesn't affect normal operations.
Garbage Collection. Check the gc.log.* files. If it's GC, it can usually be fixed by reading up on and adjusting the GC settings. If there isn't anyone on your team who is a JVM GC expert, I recommend using G1GC as it removes a lot of the guesswork.
Do note that everything I mentioned above can never be fixed with a reboot. In fact, it's likely it'll pick right back up where it left off.
I encountered a consistency problem using Hector and Cassandra when we have Quorum for both read and write.
I use MultigetSubSliceQuery to query rows from super column limit size 100, and then read it, then delete it. And start another around.
I found that the row which should be deleted by my prior query is still shown from next query.
And also from a normal Column Family, I updated the value of one column from status='FALSE' to status='TRUE', and the next time I queried it, the status was still 'FALSE'.
More detail:
It has not happened not every time (1/10,000)
The time between the two queries is around 500 ms (but we found one pair of queries in which 2 seconds had elapsed between them, still indicating a consistency problem)
We use ntp as our cluster time synchronization solution.
We have 6 nodes, and replication factor is 3
I understand that Cassandra is supposed to be "eventually consistent", and that read may not happen before write inside Cassandra. But for two seconds?! And if so, isn't it then meaningless to have Quorum or other consistency level configurations?
So first of all, is it the correct behavior of Cassandra, and if not, what data we need to analyze for further investment?
After check the source code with the system log, I found the root cause of the inconsistency.
Three factors cause the problem:
Create and update same record from different nodes
Local system time is not synchronized accurately enough (although we use NTP)
Consistency level is QUORUM
Here is the problem, take following as the event sequence
seqID NodeA NodeB NodeC
1. New(.050) New(.050) New(.050)
2. Delete(.030) Delete(.030)
First Create request come from Node C with local time stamp 00:00:00.050, assume requests first record in Node A and Node B, then later synchronized with Node C.
Then Delete request come from Node A with local time stamp 00:00:00.030, and record in node A and Node B.
When read request come, Cassandra will do version conflict merge, but the merge only depend on time stamp, so although Delete happened after Create, but the merge final result is "New" which has latest time stamp due to local time synchronization issue.
I also faced similar a issue. The issue occured because cassandra driver uses server timestamp by default to check which query is latest. However in latest version of cassandra driver they have changes it and now by default they are using client timestamp.
I have described the details of issue here
The deleted rows may be showing up as "range ghosts" because of the way that distributed deletes work: see http://wiki.apache.org/cassandra/FAQ#range_ghosts
If you are reading and writing individual columns both at CL_QUORUM, then you should always get full consistency, regardless of the time interval (provided strict ordering is still observed, i.e. you are certain that the read is always after the write). If you are not seeing this, then something, somewhere, is wrong.
To start with, I'd suggest a) verifying that the clients are syncing to NTP properly, and/or reproduce the problem with times cross-checked between clients somehow, and b) maybe try to reproduce the problem with CL_ALL.
Another thought - are your clients synced with NTP, or just the Cassandra server nodes? Remember that Cassandra uses the client timestamps to determine which value is the most recent.
I'm running into this problem with one of my clients/node. The other 2 clients I'm testing with (and 2 other nodes) run smoothly. I have a test that uses QUORUM in all reads and all writes and it fails very quickly. Actually some processes do not see anything from the others and others may always see data even after I QUORUM remove it.
In my case I turned on the logs and intended to test the feat with the tail -F command:
tail -F /var/lib/cassandra/log/system.log
to see whether I was getting some errors as presented here. To my surprise the tail process itself returned an error:
tail: inotify cannot be used, reverting to polling: Too many open files
and from another thread this means that some processes will fail opening files. In other words, the Cassandra node is likely not responding as expected because it cannot properly access data on disk.
I'm not too sure whether this is related to the problem that the user who posted the question, but tail -F is certainly a good way to determine whether the limit of files was reached.
(FYI, I have 5 relatively heavy servers running on the same machine so I'm not too surprise about the fact. I'll have to look into increasing the ulimit. I'll report here again if I get it fixed in this way.)
More info about the file limit and the ulimit command line option: https://askubuntu.com/questions/181215/too-many-open-files-how-to-find-the-culprit
--------- Update 1
Just in case, I first tested using Java 1.7.0-11 from Oracle (as mentioned below, I first used a limit of 3,000 without success!) The same error would popup at about the same time when running my Cassandra test (Plus even with the ulimit of 3,000 the tail -F error would still appear...)
--------- Update 2
Okay! That worked. I changed the ulimit to 32,768 and the problems are gone. Note that I had to enlarge the per user limit in /etc/security/limits.conf and run sudo sysctl -p before I could bump the maximum to such a high number. Somehow the default upper limit of 3000 was not enough even though the old limit was only 1024.