I have a single node cassandra server. It has been working well for a long time until after a server restart I made yesterday.
Now nodetool status gives me the following error
error: null
-- StackTrace --
java.lang.ClassCastException
Cassandra itself seems to be running. The subset of the logs:
...
INFO 03:58:18 Node /10.0.0.4 state jump to NORMAL
INFO 03:58:18 Waiting for gossip to settle before accepting client requests...
INFO 03:58:26 No gossip backlog; proceeding
Release version is 3.7
I haven't been able to crack this. I'd be very thankful for any help.
Let me know if I can provide any more useful information.
Related
I have Hadoop 3.2.2 running on a cluster with 1 name node, 2 data nodes and 1 resource manager node. I tried to run the sparkpi example on cluster mode. The spark-submit is done from my local machine. YARN accepts the job but the application UI says
this. Further in the terminal where I submitted the job it says
2021-06-05 13:10:03,881 INFO yarn.Client: Application report for application_1622897708349_0001 (state: ACCEPTED)
This continues to print until it fails. Upon failure it prints
I tried increasing the spark.executor.heartbeatInterval to 3600 secs. Still no luck. I also tried running the code from namenode thinking there must be some connection issue with my local machine. Still I'm unable to run it
found the answer albeit I don't know why it works! Adding the private IP address to the security group in AWS did the trick.
We use vnodes on our cluster.
I noticed that when the token space of a node changes (automatically on vnodes, during a repair or a cleanup after adding new nodes), the datastax nodejs driver gets a lot of "Operation timed out - received only X responses" for a few minutes.
I tried using ONE and LOCAL_QUORUM consistencies.
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
What do you guys suggest we should do to avoid this ? Having a custom retry policy ? Caching ? Changing the consistency ?
Example of behavior
when we see this:
4/7/2016, 10:43am Info Host 172.31.34.155 moved from '8185241953623605265' to '-1108852503760494577'
We see a spike of those:
{
"message":"Operation timed out - received only 0 responses.",
"info":"Represents an error message from the server",
"code":4608,
"consistencies":1,
"received":0,
"blockFor":1,
"isDataPresent":0,
"coordinator":"172.31.34.155:9042",
"query":"SELECT foo FROM foo_bar LIMIT 10"
}
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
In fact, when adding new node, there will be token range movement but Cassandra can still serve read requests using the old token ranges until the scale out has finished completely. So the behavior you're facing is very suspicious.
If you can reproduce this error, please activate query tracing to narrow down the issue.
The error can also be related to a node under heavy load and not replying fast enough
I have a 3 node Cassandra cluster setup (replication set to 2) with Solr installed, each node having RHEL, 32 GB Ram, 1 TB HDD and DSE 4.8.3. There are lots of writes happening on my nodes and also my web application reads from my nodes.
I have observed that all the nodes go down after every 3-4 days. I have to do a restart of every node and then they function quite well till the next 3-4 days and again the same problem repeats. I checked the server logs but they do not show any error even when the server goes down. I am unable to figure out why is this happening.
In my application, sometimes when I connect to the nodes through the C# Cassandra driver, I get the following error
Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: 'node-ip':9042) at Cassandra.Tasks.TaskHelper.WaitToComplete(Task task, Int32 timeout) at Cassandra.Tasks.TaskHelper.WaitToComplete[T](Task``1 task, Int32 timeout) at Cassandra.ControlConnection.Init() at Cassandra.Cluster.Init()`
But when I check the OpsCenter, none of the nodes are down. All nodes status show perfectly fine. Could this be a problem with the driver? Earlier I was using Cassandra C# driver version 2.5.0 installed from nuget, but now I updated even that to version 3.0.3 still this errors persists.
Any help on this would be appreciated. Thanks in advance.
If you haven't done so already, you may want to look at setting your logging levels to default by running: nodetool -h 192.168.XXX.XXX setlogginglevel org.apache.cassandra DEBUG on all your nodes
Your first issue is most likely an OutOfMemory Exception.
For your second issue, the problem is most likely that you have really long GC pauses. Tailing /var/log/cassandra/debug.log or /var/log/cassandra/system.log may give you a hint but typically doesn't reveal the problem unless you are meticulously looking at the timestamps. The best way to troubleshoot this is to ensure you have GC logging enabled in your jvm.options config and then tail your gc logs taking note of the pause times:
grep 'Total time for which application threads were stopped:' /var/log/cassandra/gc.log.1 | less
The Unexpected exception during request; channel = [....] java.io.IOException: Error while read (....): Connection reset by peer error is typically inter-node timeouts. i.e. The coordinator times out waiting for a response from another node and sends a TCP RST packet to close the connection.
A node when down while bootstrapping a new node, and the bootstrapping failed. The node shut down, leaving the following messages in its log:
INFO [main] 2015-02-07 06:03:32,761 StorageService.java:1025 - JOINING: Starting to bootstrap...
ERROR [main] 2015-02-07 06:03:32,799 CassandraDaemon.java:465 - Exception encountered during startup
java.lang.RuntimeException: A node required to move the data consistently is down (/10.0.3.56). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false
How do I recover the situation? Can I restart the bootstrap process once the failed node is back online? Or do I need to revert the partial bootstrap and try again somehow?
I have tracked down the original cause. The new node was able to connect to the node at 10.0.3.56, but 10.0.3.56 was not able to open connections back to the new node. 10.0.3.56 contained the only copy of some data that needed to be moved to the new node (replication factor == 1), but its attempts to send the data were blocked.
Since this involves data move, not just replication, and based on the place in the code where exception is thrown, I assume you are trying to replace a dead node as it is described here: http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
By the look of it, the node did not get to joining the ring. You can certainly doublecheck by running nodetool status, if the node has joined at all.
If not then you can simply delete all from the data, commitlog and saved_caches, and restart the process. What was wrong with that 10.0.2.56 node?
If this node has joined the ring then it should be still safe to simply restart it once you start node 10.0.2.56 up.
While trying to use the cassandra 2.0.1 version, i started facing the handshaking with version problem .
There was an exception from OutboundTcpConnection.java stating that handshaking is not possible with a particular node.
I had a look at the TCP dump and cleared off the doubts that there was no problem in the network layer.
The application is not completing the handshaking process .Moreover , the port 7000 is still active.
For example, all my 8 nodes are up . But when i try a nodetool status, some nodes give a DN- down node status. Later on, after examining , the TCP backlog queue was found overflowing and the particular server has stopped listening for other servers in the cluster.
Am still not able to spot the root cause of this problem.
Note: I have tried with previous version of cassandra , 1.2.4, and it was working ok at that time. Before going to production , i thought it is better to go to 2.0.x version to avoid a migration overhead mainly. Can anyone provide an idea on this ?
Exception am getting is
NFO [HANDSHAKE-/aa.bb.cc.XX] 2013-10-03 17:36:16,948 OutboundTcpConnection.java (line 385) Handshaking version with /aa.bb.cc.XX
INFO [HANDSHAKE-/aa.bb.cc.YY] 2013-10-03 17:36:17,280 OutboundTcpConnection.java (line 396) Cannot handshake version with /aa.bb.cc.YY
This sounds like https://issues.apache.org/jira/browse/CASSANDRA-6349. You should upgrade.