Why connection is timing out frequently - cassandra

As a follow up on this question : Enable one time Cassandra Authentication and Authorization check and cache it forever
I would like to understand that I get Request timed Out error and If I see in the server logs I get only following error.
ERROR [SharedPool-Worker-34] 2018-06-01 10:40:36,589 ErrorMessage.java:338 - Unexpected exception during request
java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:489) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:269) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.RolesCache.getRoles(RolesCache.java:66) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:51) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71) ~[apache-cassandra-3.0.8.jar:3.0.8]
I understand that I didn't enable the caching of Authentication and Authorization in cassandra.yaml but still I Could somebody explain why I get this error frequently, Is it a costly performance operation in Cassandra?

If you're using the default Cassandra user it is a normal query with QUORUM, any other user should be using LOCAL_ONE. So in terms of "operation cost" is not anything abnormal. But given the error message (this part in specific: "Operation timed out - received only 0 responses.") means you probably have overloaded nodes that can't respond to your queries.
A quick look into your nodes using nodetool tpstats would show if you're having problems serving your reads (Look for blocked, all time blocked and/or DROPPED reads).
Auth queries are done with every query you do (AFAIK), so you should enable caches for them (and avoiding overloading your cluster)
Relevant documentation: https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/secureConfigNativeAuth.html

Related

ElasticSearch No Available connections error in Logstash

I'm running a Logstash instance which is connected to an ES cluster behind a load balancer.
The load balancer has an idle timeout of 5 minutes.
Logstash is configured with the ES url corresponding to the loadbalancer ip.
Normally everything works fine, but what happens is that after a period of requests inactivity, the next request processed by LS goes in error with the following:
[2018-10-30T08:15:00,757][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://10.100.24.254:9200/, :error_message=>"Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2018-10-30T08:15:00,759][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2018-10-30T08:15:02,760][WARN ][logstash.outputs.elasticsearch] UNEXPECTED POOL ERROR {:e=>#<LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError: No Available connections>}
[2018-10-30T08:15:02,760][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
[2018-10-30T08:15:05,651][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://10.100.24.254:9200/, :path=>"/"}
LS eventually recovers, but it takes more than 1 min and this is not acceptable for our SLA.
I suspect that's due to the loadbalancer closing the connections after 5 min of inactivity.
I've tried setting:
timeout => 3
which makes things better. The request is retried after 3 secs, but this is still not good enough.
What's the best set of configuration options that I can use to make sure the connections are always healthy and working before the requests are attempted and so I experience no delay at all?
Try validate_after_inactivity setting as described here
Or you can try enabling keep alive on your logstash server so logstash knows the connection has been severed when LB hits idle time out and it starts a new connection instead of sending requests on the old stale connection.

Request timed out is not logging in server side Cassandra

I have set server timeout in cassandra as 60 seconds and client timeout in cpp driver as 120 seconds.
I use Batch query which has 18K operations, I get the Request timed out error in cpp driver logs but in Cassandra server logs there is no TRACE available in spite of enabling ALL logs in Cassandra logback.xml
So how can I confirm that It is thrown from the server / client side in Cassandra?
BATCH is not intended to work that way. It’s designed to apply 6 or 7 mutations to different tables atomically. You’re trying to use it like it’s RDBMS counterpart (Cassandra just doesn’t work that way). The BATCH timeout is designed to protect the node/cluster from crashing due to how expensive that query is for the coordinator.
In the system.log, you should see warnings/failures concerning the sheer size of your BATCH. If you’ve modified them and don’t see that, you should see a warning about a timeout threshold being exceeded (I think BATCH gets its own timeout in 3.0).
If all else fails, run your BATCH statement (part of it) in cqlsh with tracing on, and you’ll see precisely why this is a bad idea (server side).
Also, the default query timeouts are there to protect your cluster. You really shouldn’t need to alter those. You should change your query/model or approach before looking at adjusting the timeout.

Difference between timeout setting in client and server side

As per this flow : Cassandra read_request_timeout_in_ms set up for external(Client) request , I understand that setting the timeout in server side is not just enough, we need to set in client side too.
What is the difference between setting timeout in client and server side?
Example :
Setting the request time out in server side in Cassandra (cassandra.yaml)
VS
Setting the request time out in client side in Cassandra driver
EDITED :
driver read timeout: the driver did not receive any response from the current coordinator within SocketOptions.setReadTimeoutMillis. It invokes onRequestError on the retry policy with an OperationTimedOutException to decide what to do.
server read timeout: the driver did receive a response, but that response indicates that the coordinator timed out while waiting for other replicas. It invokes onReadTimeout on the retry policy to decide what to do.
Could somebody clearly explain the purpose and difference between both please.
Setting the timeout at server-side i.e in cassandra.yaml is not the same as setting the driver (aka client-side) timeout using SocketOptions.setReadTimeoutMillis. They both work individually. One does not over-ride the other. In general, you should set driver timeout slightly larger than server-side timeout.
If Cassandra node is reachable and working but it's not able to respond within read time mentioned in cassandra.yaml it'll throw an exception and driver will get the same exception. The driver might retry if configured.
If Cassandra node does not respond due to some reason, the driver can't wait indefinitely. That's when driver timeout kicks in and throws exception if Cassandra is not responding.

Cassandra nodejs driver time out after a node moves

We use vnodes on our cluster.
I noticed that when the token space of a node changes (automatically on vnodes, during a repair or a cleanup after adding new nodes), the datastax nodejs driver gets a lot of "Operation timed out - received only X responses" for a few minutes.
I tried using ONE and LOCAL_QUORUM consistencies.
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
What do you guys suggest we should do to avoid this ? Having a custom retry policy ? Caching ? Changing the consistency ?
Example of behavior
when we see this:
4/7/2016, 10:43am Info Host 172.31.34.155 moved from '8185241953623605265' to '-1108852503760494577'
We see a spike of those:
{
"message":"Operation timed out - received only 0 responses.",
"info":"Represents an error message from the server",
"code":4608,
"consistencies":1,
"received":0,
"blockFor":1,
"isDataPresent":0,
"coordinator":"172.31.34.155:9042",
"query":"SELECT foo FROM foo_bar LIMIT 10"
}
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
In fact, when adding new node, there will be token range movement but Cassandra can still serve read requests using the old token ranges until the scale out has finished completely. So the behavior you're facing is very suspicious.
If you can reproduce this error, please activate query tracing to narrow down the issue.
The error can also be related to a node under heavy load and not replying fast enough

Cassandra Nodes Going Down

I have a 3 node Cassandra cluster setup (replication set to 2) with Solr installed, each node having RHEL, 32 GB Ram, 1 TB HDD and DSE 4.8.3. There are lots of writes happening on my nodes and also my web application reads from my nodes.
I have observed that all the nodes go down after every 3-4 days. I have to do a restart of every node and then they function quite well till the next 3-4 days and again the same problem repeats. I checked the server logs but they do not show any error even when the server goes down. I am unable to figure out why is this happening.
In my application, sometimes when I connect to the nodes through the C# Cassandra driver, I get the following error
Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: 'node-ip':9042) at Cassandra.Tasks.TaskHelper.WaitToComplete(Task task, Int32 timeout) at Cassandra.Tasks.TaskHelper.WaitToComplete[T](Task``1 task, Int32 timeout) at Cassandra.ControlConnection.Init() at Cassandra.Cluster.Init()`
But when I check the OpsCenter, none of the nodes are down. All nodes status show perfectly fine. Could this be a problem with the driver? Earlier I was using Cassandra C# driver version 2.5.0 installed from nuget, but now I updated even that to version 3.0.3 still this errors persists.
Any help on this would be appreciated. Thanks in advance.
If you haven't done so already, you may want to look at setting your logging levels to default by running: nodetool -h 192.168.XXX.XXX setlogginglevel org.apache.cassandra DEBUG on all your nodes
Your first issue is most likely an OutOfMemory Exception.
For your second issue, the problem is most likely that you have really long GC pauses. Tailing /var/log/cassandra/debug.log or /var/log/cassandra/system.log may give you a hint but typically doesn't reveal the problem unless you are meticulously looking at the timestamps. The best way to troubleshoot this is to ensure you have GC logging enabled in your jvm.options config and then tail your gc logs taking note of the pause times:
grep 'Total time for which application threads were stopped:' /var/log/cassandra/gc.log.1 | less
The Unexpected exception during request; channel = [....] java.io.IOException: Error while read (....): Connection reset by peer error is typically inter-node timeouts. i.e. The coordinator times out waiting for a response from another node and sends a TCP RST packet to close the connection.

Resources