Retrying SQLSTATE XX000 in YugabyteDB - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
Just got this error message while running my query:
Network error: recvmsg got EOF from remote (SQLSTATE XX000)
Is this error retryable?

You should treat this as a timeout error against any DB. The app should retry this type of error - but it is possible that the previous operation already succeeded.. .and for example, you could get a duplicate primary key error (if you are say retrying an INSERT).
In this particular case though, our expectation is that mostly like there was a yb-tserver restart or connection failure or partition of some kind.

Related

Getting "Already present: Duplicate request" errors in YugabyteDB YSQL

[Question posted by a user on YugabyteDB Community Slack]
Sending a lot of concurrent requests in the YSQL layer, we're getting Already present: Duplicate request, XX000 errors. Are these safe to retry ?
Here’s a in-progress list of the retryable cases:
If you get Duplicate Request in the middle of transaction, then it could be safely retried.
If Duplicate Request was generated by standalone statement, then it
is possible that original statement was executed and applied to db.
So safety depends on the query and application.
Catalog Version Mismatch — this is really a txn conflict but specifically with a DDL. There’s already an issue to change the error code for this: https://github.com/yugabyte/yugabyte-db/issues/8597.
Any error with Try Again prefix — but these might already be re-mapped/retried internally.
Leader not ready to serve requests. and Leader does not have a valid lease. are also retryable.
TimedOut requests are also retryable in some cases (e.g. for pure reads and operations in transction blocks), but not safe for single-row writes.
Generally for non-XX000 error codes the same rules apply as for vanilla Postgres. (e.g. 40001 is retryable — and we should already map YB transaction errors and read restart required errors to that error code).
For XX000 (internal error), there are a number of specific errors that should be safe to retry.
Internally we already re-map some of the errors to YSQL/PG error codes (like mentioned above) and generally we aim to do that appropriately.
The full list of internal error codes are at: https://github.com/yugabyte/yugabyte-db/blob/master/src/yb/util/status.h#L149

Why connection is timing out frequently

As a follow up on this question : Enable one time Cassandra Authentication and Authorization check and cache it forever
I would like to understand that I get Request timed Out error and If I see in the server logs I get only following error.
ERROR [SharedPool-Worker-34] 2018-06-01 10:40:36,589 ErrorMessage.java:338 - Unexpected exception during request
java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:489) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:269) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.RolesCache.getRoles(RolesCache.java:66) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:51) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71) ~[apache-cassandra-3.0.8.jar:3.0.8]
I understand that I didn't enable the caching of Authentication and Authorization in cassandra.yaml but still I Could somebody explain why I get this error frequently, Is it a costly performance operation in Cassandra?
If you're using the default Cassandra user it is a normal query with QUORUM, any other user should be using LOCAL_ONE. So in terms of "operation cost" is not anything abnormal. But given the error message (this part in specific: "Operation timed out - received only 0 responses.") means you probably have overloaded nodes that can't respond to your queries.
A quick look into your nodes using nodetool tpstats would show if you're having problems serving your reads (Look for blocked, all time blocked and/or DROPPED reads).
Auth queries are done with every query you do (AFAIK), so you should enable caches for them (and avoiding overloading your cluster)
Relevant documentation: https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/secureConfigNativeAuth.html

Sporadic Cassandra WriteErrors when using Lightweight Transactions

I have a service that connects to our Cassandra cluster and executes tens of thousands of queries per day using Lightweight (ACID) Transactions to implement the Consensus system desribed here. For the most part it works fine, but sporadically, the writes will fail with the error saying "Operation timed out - received only 1 responses" (or less commonly, only 0 responses). We're using the Datastax Python driver. When the error occurs, the full error line (at the end of the stack trace) reads:
WriteTimeout: Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 1 responses." info={'received_responses': 1, 'required_responses': 2, 'consistency': 'LOCAL_SERIAL'}
Is this something that seems expected to occur from time to time in a production Cassandra setup? Or does it seem like something where we could have a configuration problem with our Cassandra cluster or network?
Some information about our Cassandra cluster: It is an 8-node setup spread across 2 Amazon EC2 regions (4 nodes per region). All of the nodes are running version 3.3.0 of the Datastax Cassandra distribution.
From https://issues.apache.org/jira/browse/CASSANDRA-9328
There is cases where under contention the coordinator loses track of
whether the value it submitted to Paxos might be applied or not (see
CASSANDRA-6013). At which point we can't do anything else that
answering "sorry I don't know". And since a WriteTimeoutException
already means "I don't know", we throw it in that case too, even
though it's not a proper timeout per-se

"ERROR while connecting to database. Error: Error: No valid replicaset instance servers found"

I'm using replicaset with 2 nodes(primary and secondary) and 1 arbiter(total 3).
Sometimes I get "ERROR while connecting to database. Error: Error: No valid replicaset instance servers found" .I'm not able to reproduce(as it happens on its own and sometimes very frequently).I've added server.on('error',) event to debug but sometimes in my local environment it prints something like connection error printing 1 of the member host name(though I don't know whether it is related to my problem).
When I connect to one of the instance through mongo shell and check rs.status() I get everything fine,with all members healthy and up.
Jira link for above question is:
https://jira.mongodb.org/browse/NODE-296
An arbiter is like higher authority that votes between secondary nodes to become primary, when the actual primary is down. Add arbiter to odd number of nodes. Bcoz with 2 nodes , when one is down then the other is solely a primary bcoz there is no competition to it.Try adding few more nodes.

ProviderIncompatibleException was unhandled by user code

I have just got the latest code from SVN and I got the above error when I logged into my application. The exception message was:
An error occurred while getting provider information from the
database. This can be caused by Entity Framework using an incorrect
connection string. Check the inner exceptions for details and ensure
that the connection string is correct.
The inner exception says:
The client was unable to establish a connection because of an error
during connection initialization process before login. Possible causes
include the following: the client tried to connect to an unsupported
version of SQL Server; the server was too busy to accept new
connections; or there was a resource limitation (insufficient memory
or maximum allowed connections) on the server. (provider: Shared
Memory Provider, error: 0 - The handle is invalid.
The issue is, none of these suggestions seem like the cause. Any idea what might cause this?
You're going to love the solution to this. I restarted my machine and it works fine now. :o).

Resources