Cassandra PasswordAuthenticator causing timeout - cassandra

I have a Cassandra cluster with 3 nodes and want to enable PasswordAuthenticator. I have made the following changes in cassandra.yaml.
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer
role_manager: CassandraRoleManager
roles_validity_in_ms: 60000
roles_update_interval_in_ms: 60000
permissions_validity_in_ms: 60000
permissions_update_interval_in_ms: 60000
credentials_validity_in_ms: 60000
credentials_update_interval_in_ms: 60000
I have increased the validity interval to 60 seconds because there won't be frequent changes in roles.
Now, when i restart a cassandra node, the node sometimes connect successfully with client but after a few seconds, it starts giving "Connection timeout" error. Also, the CPU load increases to 100%. I have attached both screenshots.
During this time, if i run nodetool status, all 3 nodes are shown as UN and service cassandra status also shows Active status
Note: I have not enabled PasswordAuthenticator on all nodes. I just tried it on one node and it starts giving timeout error on connection request.
UPDATE: Tried enabling on all nodes but still getting same error.

You are getting the Connection timeout because it's likely your app has hit a node which doesn't have authentication enabled.
You need to enable authentication on ALL nodes or your test is not going to be valid. You are not using the feature as it is designed so it shouldn't be a surprise that it is not working as expected. Cheers!

Related

Cassandra - Frequent cross-node timeouts

I am observing timeouts in the Cassandra cluster with the following logs in debug.log:
time 1478 msec - slow timeout 500 msec/cross-node
Does this represent that the read request is spending 1378 ms for the other replicas to respond?
The NTP is in sync for this cluster with fewer data and good CPU and memory allocated.
Does setting cross_node_timeout: truegoing to help?
Cassandra version: 3.11.6
Thanks
The value 1478 msec reported in logs is the time recorderd for a particular query to execute. As it is cross-node which signifies that this query/operation was performed across nodes. This is just a warning that your queries are running slower. Default value of slow query timeout is 500ms and can be set in cassandra.yaml by slow_query_log_timeout_in_ms.
If this is one off log in your logs, then it could have been caused by GC. If it is consistently showing up, then something is wrong in your environment(network etc) or your query.
Regarding the property cross_node_timeout: true, it was introduced via CASSANDRA-4812. Purpose of this property is to avoid timeouts in case NTP server is not synced across nodes. Default value of this property is false. Since NTP is synced on your cluster, you can do it to true but it will not help in message you are getting.

cassandra 2.2.8 cluster timeout exceptions

I have a 3 node cluster with low load. Any write/read attempts to Cassandra are getting timed out. The 'nodetool status' shows every thing up, however, 'nodetool describecluster' shows the other nodes as UNREACHABLE (not because of schema mismatch, because I don't see any schema mentioned next to the unreachable nodes.)
# nodetool describecluster
Cluster Information:
Name: ------
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
8b7c6bca-f4f8-3d49-a4cc-64ec69bf8573: [10.65.221.36]
UNREACHABLE: [10.65.221.20, 10.65.221.4]
cqlsh command is also timing out (despite increasing the timeout).
I see NTR all time blocked high. No error messages on Cassandra logs either.'nodetool netstats' shows lot of small messages with high values in pending and completed. Not sure what the small messages imply. Any suggestions on how to debug this further.
It seems a port issue, check if you have 9042 port open.
Run this to check open ports:
netstat -na|grep LISTEN
you can have a look on this link for
more information. https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/secureFireWall.html

com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response

We are using Apache Cassandra-v3.0.9 with com.datastax.cassandra:cassandra-driver-core:3.1.3. Our application works good all the time, but once in a week we start getting the following exception from our applications:
com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.ChainedResultSetFuture.getUninterruptibly(ChainedResultSetFuture.java:62)
at com.datastax.driver.core.NewRelicChainedResultSetFuture.getUninterruptibly(NewRelicChainedResultSetFuture.java:11)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
at com.til.cms.graphdao.cassandra.dao.generic.CassandraICMSGenericDaoImpl.getCmsEntityMapForLimitedSize(CassandraICMSGenericDaoImpl.java:2824)
.....
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:770)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
These applications are hitting Cassandra datacenter for read requests. The data-center consists of 5 physical servers each with 2 disks, 64 GB RAM, 40 cores, 16GB heap with G1 GC.
There was no problem with Cassandra servers as per our investigation like there was no load average/iowait increase, gc pauses or nodetool/cqlsh connectivity etc. We just started getting these exceptions in our application logs until we restarted Cassandra servers. This exception was reported randomly for different Cassandra servers in the datacenter and we had to restart each of them. In normal time each of these Cassandra server servers 10K read requests/seconds and hardly 10 write requests/seconds. When we encounter this problem read requests are dramatically affected to 2-3 K/seconds.
The replication factor of our cassandra datacenter is 3 and following is way we are making connections
Cluster.builder()
.addContactPoints(nodes)
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.LOCAL_ONE))
.withLoadBalancingPolicy(DCAwareRoundRobinPolicy.builder().withLocalDc(localDatacenter).build())
.withSpeculativeExecutionPolicy(PerHostPercentileTracker.builder(13000).build())
.build()
EDIT:
We have observed before we start getting these exceptions, we getting the following WARN level exceptions in our java application.
2018-04-03 23:40:06,456 WARN [cluster1-timeouter-0]
com.datastax.driver.core.RequestHandler [RequestHandler.java:805] Not
retrying statement because it is not idempotent (this message will be
logged only once). Note that this version of the driver changes the
default retry behavior for non-idempotent statements: they won't be
automatically retried anymore. The driver marks statements
non-idempotent by default, so you should explicitly call
setIdempotent(true) if your statements are safe to retry. See
https://docs.datastax.com/en/developer/java-driver/3.1/manual/retries/ for more details.
2018-04-04 00:04:24,856 WARN [cluster1-nio-worker-2]
com.datastax.driver.core.PercentileTracker
[PercentileTracker.java:108] Got request with latency of 16632 ms,
which exceeds the configured maximum trackable value 13000
2018-04-04 00:04:24,858 WARN [cluster1-timeouter-0]
com.datastax.driver.core.PercentileTracker
[PercentileTracker.java:108] Got request with latency of 16712 ms,
which exceeds the configured maximum trackable value 13000

Cassandra open port native_transport_port after a random time

I'm running a cluster of 10 Cassandra 3.10 and I saw a very strange behavior: after restart, a node won't open immediately native_transport_port (9042).
After one node restart, the flow is :
node finishes to read all commitlog,
update all its data,
it's visible for other nodes in the cluster,
wait for random time (from 1 minute to hours) to open 9042 port
My logs are in DEBUG mode, and nothing is written about opening this port.
What is happening and how can I debug this problem?
Output for several nodetool commands are:
nodetool enablebinary does not return at all
nodetool compactionstats 0 pending tasks
nodetool netstats Mode: STARTING. Not sending any streams.
nodetool info: Gossip active : true
Thrift active : false
Native Transport active: false
Thank you.
Are you saving your key/row cache? It tends to take a lot of time when that is the case. Also, what is your file max limit?

Cassandra 2.1.2 node stuck on joining the cluster

I'm trying but failing to join a new (well old, but wiped out) node to an existing cluster.
Currently cluster consists of 2 nodes and runs C* 2.1.2. I start a third node with 2.1.2, it gets to joining state, it bootstraps, i.e. streams some data as shown by nodetool netstats, but after some time, it gets stuck. From that point nothing gets streamed, the new node stays in joining state. I restarted node twice, everytime it streamed more data, but then got stuck again. (I'm currently on a third round like that).
Other facts:
I don't see any errors in the log on any of the nodes.
The connectivity seems fine, I can ping, netcat to port 7000 all ways.
I have 267 GB load per running node, replication 2, 16 tokens.
Load of a new node is around 100GBs now
I'm guessing that the node after few rounds of restarts, will finally suck in all of the data from running nodes and join the cluster. But definitely it's not the way it should work.
EDIT: I discovered some more info:
The bootstrapping process stops in the middle of streaming some table, always after sending exactly 10MB of some SSTable, e.g.:
$ nodetool netstats | grep -P -v "bytes\(100"
Mode: NORMAL
Bootstrap e0abc160-7ca8-11e4-9bc2-cf6aed12690e
/192.168.200.16
Sending 516 files, 124933333900 bytes total
/home/data/cassandra/data/leadbullet/page_view-2a2410103f4411e4a266db7096512b05/leadbullet-page_view-ka-13890-Data.db 10485760/167797071 bytes(6%) sent to idx:0/192.168.200.16
Read Repair Statistics:
Attempted: 2016371
Mismatch (Blocking): 0
Mismatch (Background): 168721
Pool Name Active Pending Completed
Commands n/a 0 55802918
Responses n/a 0 425963
I can't diagnose the error & I'll be grateful for any help!
Try to telnet from one node to another using correct port.
Make sure you are joining the correct name cluster.
Try use: nodetool repair
You might be pinging the external IP addressed, and your cluster communicates using internal IP addresses.
If you are running on Amazon AWS, make sure you have firewall open on both internal IP addresses.

Resources