cassandra 2.2.8 cluster timeout exceptions

cassandra 2.2.8 cluster timeout exceptions - cassandra

I have a 3 node cluster with low load. Any write/read attempts to Cassandra are getting timed out. The 'nodetool status' shows every thing up, however, 'nodetool describecluster' shows the other nodes as UNREACHABLE (not because of schema mismatch, because I don't see any schema mentioned next to the unreachable nodes.)
# nodetool describecluster
Cluster Information:
Name: ------
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
8b7c6bca-f4f8-3d49-a4cc-64ec69bf8573: [10.65.221.36]
UNREACHABLE: [10.65.221.20, 10.65.221.4]
cqlsh command is also timing out (despite increasing the timeout).
I see NTR all time blocked high. No error messages on Cassandra logs either.'nodetool netstats' shows lot of small messages with high values in pending and completed. Not sure what the small messages imply. Any suggestions on how to debug this further.

It seems a port issue, check if you have 9042 port open.
Run this to check open ports:
netstat -na|grep LISTEN
you can have a look on this link for
more information. https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/secureFireWall.html

Related

Cassandra - Frequent cross-node timeouts

I am observing timeouts in the Cassandra cluster with the following logs in debug.log:
time 1478 msec - slow timeout 500 msec/cross-node
Does this represent that the read request is spending 1378 ms for the other replicas to respond?
The NTP is in sync for this cluster with fewer data and good CPU and memory allocated.
Does setting cross_node_timeout: truegoing to help?
Cassandra version: 3.11.6
Thanks

The value 1478 msec reported in logs is the time recorderd for a particular query to execute. As it is cross-node which signifies that this query/operation was performed across nodes. This is just a warning that your queries are running slower. Default value of slow query timeout is 500ms and can be set in cassandra.yaml by slow_query_log_timeout_in_ms.
If this is one off log in your logs, then it could have been caused by GC. If it is consistently showing up, then something is wrong in your environment(network etc) or your query.
Regarding the property cross_node_timeout: true, it was introduced via CASSANDRA-4812. Purpose of this property is to avoid timeouts in case NTP server is not synced across nodes. Default value of this property is false. Since NTP is synced on your cluster, you can do it to true but it will not help in message you are getting.

Cassandra open port native_transport_port after a random time

I'm running a cluster of 10 Cassandra 3.10 and I saw a very strange behavior: after restart, a node won't open immediately native_transport_port (9042).
After one node restart, the flow is :
node finishes to read all commitlog,
update all its data,
it's visible for other nodes in the cluster,
wait for random time (from 1 minute to hours) to open 9042 port
My logs are in DEBUG mode, and nothing is written about opening this port.
What is happening and how can I debug this problem?
Output for several nodetool commands are:
nodetool enablebinary does not return at all
nodetool compactionstats 0 pending tasks
nodetool netstats Mode: STARTING. Not sending any streams.
nodetool info: Gossip active : true
Thrift active : false
Native Transport active: false
Thank you.

Are you saving your key/row cache? It tends to take a lot of time when that is the case. Also, what is your file max limit?

Cassandra Nodes Going Down

I have a 3 node Cassandra cluster setup (replication set to 2) with Solr installed, each node having RHEL, 32 GB Ram, 1 TB HDD and DSE 4.8.3. There are lots of writes happening on my nodes and also my web application reads from my nodes.
I have observed that all the nodes go down after every 3-4 days. I have to do a restart of every node and then they function quite well till the next 3-4 days and again the same problem repeats. I checked the server logs but they do not show any error even when the server goes down. I am unable to figure out why is this happening.
In my application, sometimes when I connect to the nodes through the C# Cassandra driver, I get the following error
Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: 'node-ip':9042) at Cassandra.Tasks.TaskHelper.WaitToComplete(Task task, Int32 timeout) at Cassandra.Tasks.TaskHelper.WaitToComplete[T](Task``1 task, Int32 timeout) at Cassandra.ControlConnection.Init() at Cassandra.Cluster.Init()`
But when I check the OpsCenter, none of the nodes are down. All nodes status show perfectly fine. Could this be a problem with the driver? Earlier I was using Cassandra C# driver version 2.5.0 installed from nuget, but now I updated even that to version 3.0.3 still this errors persists.
Any help on this would be appreciated. Thanks in advance.

If you haven't done so already, you may want to look at setting your logging levels to default by running: nodetool -h 192.168.XXX.XXX setlogginglevel org.apache.cassandra DEBUG on all your nodes
Your first issue is most likely an OutOfMemory Exception.
For your second issue, the problem is most likely that you have really long GC pauses. Tailing /var/log/cassandra/debug.log or /var/log/cassandra/system.log may give you a hint but typically doesn't reveal the problem unless you are meticulously looking at the timestamps. The best way to troubleshoot this is to ensure you have GC logging enabled in your jvm.options config and then tail your gc logs taking note of the pause times:
grep 'Total time for which application threads were stopped:' /var/log/cassandra/gc.log.1 | less
The Unexpected exception during request; channel = [....] java.io.IOException: Error while read (....): Connection reset by peer error is typically inter-node timeouts. i.e. The coordinator times out waiting for a response from another node and sends a TCP RST packet to close the connection.

Cassandra 2.1.2 node stuck on joining the cluster

I'm trying but failing to join a new (well old, but wiped out) node to an existing cluster.
Currently cluster consists of 2 nodes and runs C* 2.1.2. I start a third node with 2.1.2, it gets to joining state, it bootstraps, i.e. streams some data as shown by nodetool netstats, but after some time, it gets stuck. From that point nothing gets streamed, the new node stays in joining state. I restarted node twice, everytime it streamed more data, but then got stuck again. (I'm currently on a third round like that).
Other facts:
I don't see any errors in the log on any of the nodes.
The connectivity seems fine, I can ping, netcat to port 7000 all ways.
I have 267 GB load per running node, replication 2, 16 tokens.
Load of a new node is around 100GBs now
I'm guessing that the node after few rounds of restarts, will finally suck in all of the data from running nodes and join the cluster. But definitely it's not the way it should work.
EDIT: I discovered some more info:
The bootstrapping process stops in the middle of streaming some table, always after sending exactly 10MB of some SSTable, e.g.:
$ nodetool netstats | grep -P -v "bytes\(100"
Mode: NORMAL
Bootstrap e0abc160-7ca8-11e4-9bc2-cf6aed12690e
/192.168.200.16
Sending 516 files, 124933333900 bytes total
/home/data/cassandra/data/leadbullet/page_view-2a2410103f4411e4a266db7096512b05/leadbullet-page_view-ka-13890-Data.db 10485760/167797071 bytes(6%) sent to idx:0/192.168.200.16
Read Repair Statistics:
Attempted: 2016371
Mismatch (Blocking): 0
Mismatch (Background): 168721
Pool Name Active Pending Completed
Commands n/a 0 55802918
Responses n/a 0 425963
I can't diagnose the error & I'll be grateful for any help!

Try to telnet from one node to another using correct port.
Make sure you are joining the correct name cluster.
Try use: nodetool repair
You might be pinging the external IP addressed, and your cluster communicates using internal IP addresses.
If you are running on Amazon AWS, make sure you have firewall open on both internal IP addresses.

Cassandra shows two different rings for one cluster

We have a cassandra cluster of 3 nodes. yesterday I stopped one of the node and started it again today. Surprisingly now I have the different ring for the new node. why it is showing as different ring and there are no error messages in the logs.
ring 1: nodetool status
UN 1.2.3.4
UN 5.6.7.8
ring 2: nodetool status
UN 9.10.11.12
When I see the logs of ring 1 both the nodes shows the same message:
WARN [WRITE-/9.10.11.12] 2013-11-05 14:04:51,221 SSLFactory.java (line
139) Filtering out TLS_RSA_WITH_AES_256_CBC_SHA as it isnt supported
by the socket
Ring 2:
It has no errors
Both the cluster names are same and both are in the same network and all the three nodes are seed nodes. Any help would be appreciated.

Just a guess, but this may be related to How to remove a node from gossip in cassandra, in that you may have a ring that is a bad state.

I faced similar issue, Before starting node again clear Cassandra cache and hits directory solved my issue.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string