Updated cluster config does not contain myself. Rejecting component=arangodb - arangodb

My arangodb cluster is struggling to stay up. A node will periodically go read and show messages like this in the logs:
2023-01-11T21:23:04Z |WARN| Updated cluster config does not contain myself. Rejecting component=arangodb
2023-01-11T21:23:14Z |WARN| Updated cluster config does not contain myself. Rejecting component=arangodb
2023-01-11T21:23:24Z |WARN| Updated cluster config does not contain myself. Rejecting component=arangodb
2023-01-11T21:23:34Z |WARN| Updated cluster config does not contain myself. Rejecting component=arangodb
2023-01-11T21:23:44Z |WARN| Updated cluster config does not contain myself. Rejecting component=arangodb
I haven't been able to find any info on what causes this. Network connectivity seems good (all the nodes can reach each other).
This is a 9 node arango cluster: 3 agents, 9 dbservers and 9 coordinators. Started via the arangodb launcher. (version 3.10.1)
I have tried trace and debug logging in the hopes of finding a better idea of what is going on but no luck.

Related

Vertx clustered eventbus not removing old node on kubernetes rolling deployment

I have two vertx micro services running in cluster and communicate with each other using a headless service(link) in on premise cloud. Whenever I do a rolling deployment I am facing connectivity issue within services. When I analysed the log I can see that old node/pod is getting removed from cluster list but the event bus is not removing it and using it in round robin basis.
Below is the member group information before deployment
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80 //pod 1
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447 //pod 2
When deployment is started, pod 2 gets removed from the member list,
[192.168.4.54]:5701 [dev] [4.0.2] Could not connect to: /192.168.101.79:5701. Reason: SocketException[Connection refused to address /192.168.101.79:5701]
Removing connection to endpoint [192.168.101.79]:5701 Cause => java.net.SocketException {Connection refused to address /192.168.101.79:5701}, Error-Count: 5
Removing Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447
And new member is added,
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.94.85]:5701 - 1347e755-1b55-45a3-bb9c-70e07a29d55b //new pod
All migration tasks have been completed. (repartitionTime=Mon May 10 08:54:19 MST 2021, plannedMigrations=358, completedMigrations=358, remainingMigrations=0, totalCompletedMigrations=3348, elapsedMigrationTime=1948ms, totalElapsedMigrationTime=27796ms)
But when a request is made to the deployed service, event though old pod is removed from member group the event bus is using the old pod/service reference(ac0dcea9-898a-4818-b7e2-e9f8aaefb447),
[vert.x-eventloop-thread-1] DEBUG io.vertx.core.eventbus.impl.clustered.ConnectionHolder - tx.id=f9f5cfc9-8ad8-4eb1-b12c-322feb0d1acd Not connected to server ac0dcea9-898a-4818-b7e2-e9f8aaefb447 - starting queuing
I checked the official documentation for rolling deployment and my deployment seems to be following two key things mentioned in documentation, only one pod removed and then the new one is added.
never start more than one new pod at once
forbid more than one unavailable pod during the process
I am using vertx 4.0.3 and hazelcast kubernetes 1.2.2. My verticle class is extending AbstractVerticle and deploying using,
Vertx.clusteredVertx(options, vertx -> {
vertx.result().deployVerticle(verticleName, deploymentOptions);
Sorry for the long post, any help is highly appreciated.
One possible reason could be due to a race condition with Kubernetes removing the pod and updating the endpoint in Kube-proxy as detailed in this extensive article. This race condition will lead to Kubernetes continuing to send traffic to the pod being removed after it has terminated.
One TL;DR solution is to add a delay when terminating a pod by either:
Have the service delay when it receives a SIGTERM (e.g. for 15 sec) such that it keeps responding to requests during that delay period like normal.
Use the Kubernetes preStop hook to execute a sleep 15 command on the container. This allows the service to continue responding to requests during that 15 second period while Kubernetes is updating it's endpoints. Kubernetes will send SIGTERM when the preStop hook completes.
Both solutions will give Kubernetes some time to propagate changes to it's internal components so that traffic stops being routed to the pod being removed.
A caveat to this answer is that I'm not familiar with Hazelcast Clustering and how your specific discover mode is setup.

com.datastax.driver.core.Metadata:getHosts() returning incorrect state

Any reason why com.datastax.driver.core.Metadata:getHosts() would return state UP for a host that has shutdown?
However, nodetool status returns DN for that host.
No matter how many times I check Host.getState(), it still says UP for that dead host.
This is how I'm querying Metadata:
cluster = DseCluster.builder()
.addContactPoints("192.168.1.1", "192.168.1.2", "192.168.1.3")
.withPort(9042)
.withReconnectionPolicy(new ConstantReconnectionPolicy(2000))
.build();
cluster.getMetadata().getAllHosts();
EDIT: Updated code to reflect I'm trying to connect to 3 hosts. I should've stated that the cluster I'm connecting has 3 nodes, 2 in DC1 and another in DC2.
Also, whenever I relaunch my Java process running this code, the behavior changes. Sometimes it gives me the right states, then when I restart it again, it gives me the wrong states, and so on.
I will post an answer which I got from the datastaxacademy slack:
Host.getState() is the driver's view of what it thinks the host
state is, where nodetool status is what that C* node thinks the
state of all nodes in the clusters are from its view (propagated via
gossip) There is not a way to get that via the driver

Cassandra Nodes Going Down

I have a 3 node Cassandra cluster setup (replication set to 2) with Solr installed, each node having RHEL, 32 GB Ram, 1 TB HDD and DSE 4.8.3. There are lots of writes happening on my nodes and also my web application reads from my nodes.
I have observed that all the nodes go down after every 3-4 days. I have to do a restart of every node and then they function quite well till the next 3-4 days and again the same problem repeats. I checked the server logs but they do not show any error even when the server goes down. I am unable to figure out why is this happening.
In my application, sometimes when I connect to the nodes through the C# Cassandra driver, I get the following error
Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: 'node-ip':9042) at Cassandra.Tasks.TaskHelper.WaitToComplete(Task task, Int32 timeout) at Cassandra.Tasks.TaskHelper.WaitToComplete[T](Task``1 task, Int32 timeout) at Cassandra.ControlConnection.Init() at Cassandra.Cluster.Init()`
But when I check the OpsCenter, none of the nodes are down. All nodes status show perfectly fine. Could this be a problem with the driver? Earlier I was using Cassandra C# driver version 2.5.0 installed from nuget, but now I updated even that to version 3.0.3 still this errors persists.
Any help on this would be appreciated. Thanks in advance.
If you haven't done so already, you may want to look at setting your logging levels to default by running: nodetool -h 192.168.XXX.XXX setlogginglevel org.apache.cassandra DEBUG on all your nodes
Your first issue is most likely an OutOfMemory Exception.
For your second issue, the problem is most likely that you have really long GC pauses. Tailing /var/log/cassandra/debug.log or /var/log/cassandra/system.log may give you a hint but typically doesn't reveal the problem unless you are meticulously looking at the timestamps. The best way to troubleshoot this is to ensure you have GC logging enabled in your jvm.options config and then tail your gc logs taking note of the pause times:
grep 'Total time for which application threads were stopped:' /var/log/cassandra/gc.log.1 | less
The Unexpected exception during request; channel = [....] java.io.IOException: Error while read (....): Connection reset by peer error is typically inter-node timeouts. i.e. The coordinator times out waiting for a response from another node and sends a TCP RST packet to close the connection.

How to recover Cassandra node from failed bootstrap

A node when down while bootstrapping a new node, and the bootstrapping failed. The node shut down, leaving the following messages in its log:
INFO [main] 2015-02-07 06:03:32,761 StorageService.java:1025 - JOINING: Starting to bootstrap...
ERROR [main] 2015-02-07 06:03:32,799 CassandraDaemon.java:465 - Exception encountered during startup
java.lang.RuntimeException: A node required to move the data consistently is down (/10.0.3.56). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false
How do I recover the situation? Can I restart the bootstrap process once the failed node is back online? Or do I need to revert the partial bootstrap and try again somehow?
I have tracked down the original cause. The new node was able to connect to the node at 10.0.3.56, but 10.0.3.56 was not able to open connections back to the new node. 10.0.3.56 contained the only copy of some data that needed to be moved to the new node (replication factor == 1), but its attempts to send the data were blocked.
Since this involves data move, not just replication, and based on the place in the code where exception is thrown, I assume you are trying to replace a dead node as it is described here: http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
By the look of it, the node did not get to joining the ring. You can certainly doublecheck by running nodetool status, if the node has joined at all.
If not then you can simply delete all from the data, commitlog and saved_caches, and restart the process. What was wrong with that 10.0.2.56 node?
If this node has joined the ring then it should be still safe to simply restart it once you start node 10.0.2.56 up.

Refresh metadata of cassandra cluster

I added nodes to a cluster which initialy used the wrong network interface as listen_adress. I fixed it by changeing the listen_address to the correct IP. The cluster is running well with that configuration but clients trying to connect to that cluster still receive the wrong IPs as Metadata from cluster. Is there any way to refresh metadata of a cluster whithout decommissioning the nodes and setting up new ones again?
First of all, you may try to follow this advice: http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_gossip_purge.html
You will need to restart the entire cluster on a rolling basis - one node at a time
If this does not work, try this on each node:
USE system;
SELECT * FROM peers;
Then delete bad records from the peers and restart the node, then go to the next node and do it again.

Resources