Some Cassandra nodes show DN - gossip information is not unanimous - cassandra

I have a 16-node Cassandra cluster (3.11.9) with 3 seed nodes (.54, .115 and 164), replication factor 3 and gc_grace_seconds 10 days(default). Some nodes show DN on some other nodes but on other nodes they show up as UN. For example below is the nodetool status from the .54 and the .115 nodes:
.54
and .115
while for example on .87 every node is UN. This is happening for at least a couple of weeks now, and it started from two nodes that were showing each other down, the .54 and .147. However, it seems it expanded right now more and more nodes show DN on some nodes(but not on all). Just to also add that there were no writes these weeks.
I have tried enabling, disabling the gossip and restarting cassandra on all nodes. Generation stamp is up to date in system_auth table. I can connect to these nodes with cqlsh but, as expected, in some cases I get NoHostAvailable because some data are located on the "dead" nodes.
nodetool describecluster shows the DN nodes to be Unreachable, depending on which node I am executing it. So i.e. the .54 shows the .164,.115,.147 and .19 as Unreachable.
Also in nodetool gossipinfo everything looks ok with status: normal and up-to-date generation.
In the debug.log file I only get:
DEBUG [MessagingService-Outgoing-/192.168.100.147-Gossip] 2022-05-30 03:58:43,478 OutboundTcpConnection.java:546 - Unable to connect to /192.168.100.147
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_312]
at sun.nio.ch.Net.connect(Net.java:482) ~[na:1.8.0_312]
at sun.nio.ch.Net.connect(Net.java:474) ~[na:1.8.0_312]
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:647) ~[na:1.8.0_312]
at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:146) ~[apache-cassandra-3.11.9.jar:3.11.9]
at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:132) ~[apache-cassandra-3.11.9.jar:3.11.9]
at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:434) [apache-cassandra-3.11.9.jar:3.11.9]
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:262) [apache-cassandra-3.11.9.jar:3.11.9]
In the system.log it actually has logged for all the nodes as "Node...has restarted, now UP" and "Node...state jump to Normal". However I also noticed this, which may has nothing to do:
WARN [GossipStage:1] 2022-05-30 07:22:06,164 Gossiper.java:1693 - \
Received an ack from /192.168.100.127, who isn't a seed.
Ensure your seed list includes a live node. Exiting shadow round
Is there any way to understand what is happening and why is this happening? Do I miss something?
Please let me know if you need any more information.

This looks like a classic networking issue to me where the nodes are unable to gossip with each other because there's no connectivity on the internode port (default is 7000). The debug message you posted clearly states the cause:
java.net.NoRouteToHostException: No route to host
You need to check that there are no firewalls like iptables or firewalld blocking the traffic on port 7000, otherwise the nodes can't talk to each other.
It is simple enough to test it using Linux tools such as telnet or nc. For example, run this command on node .54:
$ telnet 192.168.100.115 7000
If you get a "connection refused" error, it means that one of the following is true:
there's no network route to the node,
the traffic to the default gossip port 7000 is blocked, or
gossip is configured on another port (check storage_port in cassandra.yaml
But in my experience, the most likely cause is that traffic is blocked by a firewall. Cheers!

Related

Cassandra 3.9, how to remote access [duplicate]

I have built Cassandra server 2.0.3, then run it. It is starting and then stopped with messages:
X:\MyProjects\cassandra\apache-cassandra-2.0.3-src\bin>cassandra.bat >log.txt
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1160)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision
(StorageService.java:416)
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageServ
ice.java:608)
at org.apache.cassandra.service.StorageService.initServer(StorageService
.java:576)
at org.apache.cassandra.service.StorageService.initServer(StorageService
.java:475)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.ja
va:346)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon
.java:461)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.jav
a:504)
What I can change to run it?
I had a similar problem with my cassandra v2.0.4 cluster running a single node.
Check your cassandra.yaml and make sure that your "listen_address" and "seeds" values match, with the exception that the seeds value requires quotes around it.
You might get this problem if your private IP address is different than the public one (like on AWS). For example, the host thinks it's "172.31.0.2" when it's visible as "55.70.33.10".
The solution to this problem is:
listen_address: 172.31.0.2
broadcast_address: 55.70.33.10
in cassandra.yaml
Make sure your cluster_name entry match on all the nodes in the cluster
(you may need to delete your storage if you changed the cluster name)
Verify that all nodes can ping to each other
broadcast_rpc_address and listen_address should be set to local IP
(not localhost or 127.0.0.1)
seeds should point to the IP address of the seed(s)
If you are on AWS and use the Ec2MultiRegionSnitch you will need to set the seeds to the public IP addresses rather than the private IPs.
I had the same problem on Ubuntu 16.04. I'm not sure which of these changes made it work, where XXX.XXX.XXX.XXX is your public facing IP address, below are selections from cassandra.yaml
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: "<ip1>,<ip2>,<ip3>"
- seeds: "XXX.XXX.XXX.XXX"
listen_address: XXX.XXX.XXX.XXX
broadcast_address: XXX.XXX.XXX.XXX
broadcast_rpc_address: XXX.XXX.XXX.XXX
listen_on_broadcast_address: true
start_rpc: true
rpc_address: XXX.XXX.XXX.XXX
I also needed to restart my Virtual Machine for some reason. ¯_(ツ)_/¯
For a quick single node setup on RHEL, I did the following:
Get info about your network interface setup:
# /sbin/ifconfig -a
It will list the interfaces and the ip addresses they are attached to.
Usually it will show an "Ethernet" interface and a "Local Loopback".
Get the associated ip addresses.
Then edit conf/cassandra.yaml:
rpc_address: [Local Loopback address]
broadcast_rpc_address: [Ethernet address]
listen_address: [Local Loopback address]
broadcast_address: [Ethernet address]
listen_on_broadcast_address: true
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "[Ethernet address]"
Then also, open the correct ports on Linux firewall, being 9042, 7000 and 7001. More info about opening ports on Linux here:
http://ask.xmodulo.com/open-port-firewall-centos-rhel.html
in cassandra.yaml, I update the seed from domain name to IP address. and it works.
Happened to me because in my configuration the "intial_token" settings was specified (I think because I just copied to configuration file over from another cluster member). After clearing the data directory, commenting out the setting and restarting the node, it worked fine for me.
I experienced this error today...
I could not find any reason for the error other than timing issues.
I restarted many times and after a while it sticked. It looks like they expect a bi-directional communication on the gossip channel and if it does not happen quickly enough (which looks like a very small amount of time to me) then they drop the line and generate that error.
In my case I just upgraded my software and restarted the computer. So it was clearly not a connection issue between the computers (I have firewalls and SSL, to complicate matters) and the node was connected before... So the one entry I found in that regard from datastax did not apply...
https://support.datastax.com/hc/en-us/articles/209691483-Bootstap-fails-with-Unable-to-gossip-with-any-seeds-yet-new-node-can-connect-to-seed-nodes
I got the same error. There can be more than one solution. Hope my mistake is what you have done.
I had my localhost IP pointing to some domain name (and I did that in order that my Spring boot application's server context is some domain name like www.example.com:8080 instead of localhost:8080, and I had the following entry in my hosts file on Windows system).
127.0.0.1 www.example.com
While my cassandra batch file was looking for localhost which it didn't find. So, I made another entry for localhost too in my hosts file as:
127.0.0.1 localhost
127.0.0.1 www.example.com
After adding it, I opened new command prompt, ran cassandra batch from the cassandra bin directory and it then worked.
Disable the firewall and SELINUX and try again
In our case ssl was enabled, and cassandra.yaml configuration looks fine as per above comments. Then we enabled ssl debugging by by adding below jvm paramter in cassandra-env.sh -Djavax.net.debug=ssl:handshake
After starting the node again we noticed below in cassandra log file
MessagingService-Outgoing-geo2_host/xx.xx.xx.xx, Exception while
waiting for close javax.net.ssl.SSLHandshakeException: Received fatal
alert: certificate_unknown
After further investigating the ssl debug logs we got to know that the certificate was not valid. After fixing this ssl issue node was able to join the cluster.
Thanks to elvingt
His answer just remind me , I need to verify that all node needs to be able to talk to each other.
https://support.datastax.com/hc/en-us/articles/209691483-Bootstap-fails-with-Unable-to-gossip-with-any-seeds-yet-new-node-can-connect-to-seed-nodes
Gossip communications must be bi-directional.
To verify use this commnd, and you need test from BOTH SIDE
nc -vz {your_node_ip} 7000
Then I recollect that I turned on my ubuntu firewall last night. I open it by
sudo ufw allow 7000/tcp
And it is working now
Getting error during startup/bootstrap
Unable to gossip with any seeds
indicates there is some issue with broadcast_address. broadcast_address is responsible for communication with other nodes not with clients.
This address must be set in seed node(mandatory for seed node), If you are using cloud VMs you might have different IPs(public and private) hence its recommended to use your private IPs for broadcast_address this will save your n/w cost as well.
# Address to broadcast to other Cassandra nodes
# Leaving this blank will set it to the same value as listen_address
broadcast_address: 10.11.xx.xxx
In my scenario I was using IBM and once I set broadcast_address in seed nodes issue got resolved.
Please make sure you are starting your seed node first then other node, this order is mandatory.
in cassandra.yaml
changing listen_address value from localhost to domainName solved my issue
I had same issue, I checked port, used tcpdump, netcat to test connections and finally it comes to expired SSL certificates on internode_encryption. I modified internode_encryption to make it 'none', restarted all nodes and it worked.
Before all neighbor nodes were down. And node repair command was failing with:
"Did not get positive replies from all endpoints"
P.S Dont leave internode_encryption as none for a long time, just regenerate certs and enable it back.

Joining a node to a cluster

I have tried to do necessary configuration to deploy multiple instances of Cassandra on 2 different nodes of multi-node cluster. But the nodes are having trouble seeing each other. Can someone give me an advice how to join a node to my cluster?
To join a node to a cluster, the following need to match-up in the nodes' cassandra.yaml files:
cluster_name
endpoint_snitch
num_tokens
Get your first node running, and make sure the following ports are open on your firewall or internal network:
7000 (gossip)
7001 (if using node-to-node SSL)
7199 (JMX)
9042 (client connections)
On your second node, make sure the second node has the first node's IP address in its seed list. All of your nodes should share the same seed list, as well. Depending on the size of your cluster, you should have two or three per data center.
Example:
# seeds is actually a comma-delimited list of addresses.
- seeds: "192.168.0.100,192.168.0.101"
Once your seed nodes are set, fire-up your second node and it should join. If it doesn't check the system.log for errors.

What happen if I didn't put all the C* hosts in DevCenter?

Let's say I have 4 nodes: host1, host2, host3 and host4. However I only add host1 and host2 as Contact hosts. What would happen if I perform any operation in DevCenter? Will the action propagate to host3 and host4? Will this cause data corruption?
Here's what will happen:
DevCenter will use the Whitelist load balancing policy 1 to connect to the provided nodes
While DevCenter uses the DataStax Java driver as the underlying connector, it does use the above mentioned load balancing policy to reduce the time needed to obtain connections (instead of the default driver's load balancing policy which requires discovering all the nodes in the cluster and initiating connection pools to all those)
DevCenter will send the request to the nodes in the list you provided
If data is local to these nodes they will take care of the requests. If data is found on the other nodes in the cluster, the nodes used for the connection will act as coordinators (basically they'll relay the requests to the nodes having the data)
Bottom line there's no risk of data corruption and the results you get will be exactly the same as for connecting to all the nodes.

I can not get Datastax DevCenter to connect to a remote database

When I try to set up a connection, I get the error
Unable to connect to 'Test Cluster': All host(s) tried for query failed
Unexpected error during transport initialization ... (host ip adresses) Channel has been closed
The remote database is on port 9161, which I added on the "Native port protocol" line.
Additionally is has a username or password, which I also added in the set up. This is all on a 64bit Windows machine.
i did have same your error!!
you have to open cmd firt
type "cd "
type "cassandra" to run cassandra server
then, you try again in DevCenter with localhost and port: 9042
i hope it may help you! ^^
There is another common scenario where this might happen in case someone stumbles upon this thread in the future.
This type of thing typically happens when the host crashes expectantly resulting in the corruption of the sstables or commitlog files.
This is why it is really important to use replication since when you get into this situation you can run nodetool repair to repair the corrupted tables and data from other nodes.
If you are not fortunate enough to have replication configured, then you are in for some data loss. Clear the suspect file from \data\commitlogs, cry a little and restart the node.

Cassandra big cluster configure the client connection

I've been looking to find how to configure a client to connect to a Cassandra cluster.
Independent of clients like Pelops, Hector, etc, what is the best way to connect to a multi-node Cassandra cluster?
Sending the string IP values works fine, but what about growing number cluster nodes in the future? Is maintaining synchronically ALL IP cluster nodes on client part?
Don't know if this answer all your questions but the growing cluster and your knowledge of clients ip are not related.
I have a 5 node cluster but the client(s) only knows 2 ip addresses: the seeds. Since each machine of the cluster knows about the seeds (each cassandra.yaml contains the seeds ip address) if new machine will be added information about new one will come "for free" on the client side.
Imagine a 5 nodes cluster with following ips
192.168.1.1
192.168.1.2 (seed)
192.168.1.3
192.168.1.4 (seed)
192.168.1.5
eg: the node .5 boot -- it will contact the seeds (node 2 and 4) and receive back information about the whole cluster. If you add a new 192.168.1.6 will behave exactly like the .5 and will point to the seeds to know the cluster situation. On the client side you don't have to change anything: you will just know that now you have 6 endpoints instead of 5.
ps: you don't have necessarily to connect to the seeds you can just connect to any node of since after having contacted the seeds each node knows the whole cluster topology
pps: it's your choice how many nodes to put in you "client known hosts", you can also put all 5 but this won't change the fact that if one node will be added you don't need to do anything on the client side
Regards,
Carlo
You will have an easier time letting the client track the state of each node. Smart clients will track endpoint state via the gossipinfo, which passes on new nodes as they appear in the cluster.

Resources