Hazelcast nodes refusing to join eachother - hazelcast

I have four VM's. I am running hazelcast in "embedded" mode along with my application, attempting to use it for hibernate l2 cache.
I get mixed behavior when I attempt to start up different groups. I think I'm having issues due to how the machines are subnetted. The machines /sbin/ifconfig show that there are three subnets of 2 machines between the four nodes (node1 and node4 show two network devices other than loopback).
Mancenter is running on a fifth node.
Machine/subnet1/subnet2
node1 10.10.40.1 10.10.27.1
node2 10.10.42.1
node3 10.10.40.2
node4 10.10.42.2 10.10.27.2
So node1 and node3 share a subnet, node1 and node4 share, node2 and node4 share.
Behavior is very inconsistent, though node1 and node2 starting together seems to reliably form a cluster, as does node1 and node3. Other combinations seem to enter split brain scenario, where it appears I have two or more clusters with the same name.
Querying our internal DNS the host names will resolve to the 10.10.40 and 10.10.42 IPs.
They have identical configurations. I have tried turning on interfaces to 10.10.40.* and 10.10.42.* along with turning hazelcast.socket.bind.any to false. Due to our deployment framework having identical configurations across a cluster is a high priority.
I have tried listing the nodes by both hostname and IP (the one that is resolved from nslookup of the hostname). Listing by hostname is going to be a requirement from operations.
In some situations I have managed to get them forming a cluster, though migration fails because it complains that it cannot reach one of the nodes.
Of curiosity I have noticed that mancenter will sometimes identify a node as another, such as currenly I have node3 and node4 running (with node1 and node2's application shut down) and it is identifying one of them as node2. I am wondering if this has to do with the fact that the nodes are running on VM's (one instance per VM). I believe the hostOS is redhat and the VM is running centOS.
Am I on the wrong track thinking this is an issue? What else can be causing this?

I managed to fix the issue, though I'm not 100% sure what the root cause was.
The version of Hazelcast I was using was 3.8, but I carried over a configuration from 3.7. At a glance the only difference was the schema being changed from "hazelcast-config-3.7.xsd" to "hazelcast-config-3.8.xsd".
I also upgraded mancenter from the 3.7 version to 3.8 version. However I'm not sure that would have any affect.
Either way my four nodes are now up and talking to each other. So if you come across this and are having a similar issue I'd recommend making sure your configuration version matches your deployed version.

Related

cassandra 3.11.x mixing vesions

We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines.
We are switching to physical machines on brand (8!) new servers that will have debian 11 and presumably cassandra 3.11.12.
Since the main version is always 3.11.x and ubuntu 16.04 is out of support, the question is: can we just let the new machines join the old cluster and then decommission the outdated?
I hope to get a tips about this becouse intuitively it seems fine but we are not too sure about that.
Thank you.
We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines. We are switching to physical machines on brand (8!)
Quick tip here; but it's a good idea to build your clusters in multiples of your RF. Not sure what your RF is, but if RF=3, I'd either stay with six or get one more and go to nine. It's all about even data distribution.
can we just let the new machines join the old cluster and then decommission the outdated?
In short, no. You'll want to upgrade the existing nodes to 3.11.12, first. I can't recall if 3.11.3 and 3.11.12 are SSTable compatible, but I wouldn't risk it.
Secondly, the best way to do this, is to build your new (physical) nodes in the cluster as their own logical data center. Start them up empty, and then run a nodetool rebuild on each. Once that's complete, then decommission the old nodes.
There is a bit simpler solution - move data from each virtual machine into a physical server, as following:
Prepare Cassandra installation on a physical machine, configure the same cluster name, etc.
1.Stop Cassandra in a virtual machine & make sure that it won't start
Copy all Cassandra data /var/lib/cassandra or something like from VM to the physical server
Start Cassandra process on a physical server
Repeat that process for all VM nodes, at some point, updating seeds, etc. After process is finished, you can add two physical servers that are left. Also, to speedup process, you can do initial copy of the data before stopping Cassandra in the VM, and after it's stopped, re-sync data with rsync or something like. This way you can minimize the downtime.
This approach would be much faster compared to the adding a new node & decommissioning the old one as we won't need to stream data twice. This works because after node is initialized, Cassandra identify nodes by assigned UUID, not by IP address.
Another approach is to follow instructions on replacement of the dead node. In this case streaming of data will happen only once, but it could be a bit slower compared to the direct copy of the data.

CouchDB 2.1 cluster gives "database failed to load" after brought up on instances with new IP addresses

I have a CouchDB cluster with 9 shards running on 3 nodes (and n=3). I brought down the three instances the nodes were running on and created 3 new instances with new internal IP addresses. After this I was getting "Database failed to load" errors for the databases.
I started to read about moving shards and I am aware of the change when going from 2.1.0 to 2.1.1 (as answered in this question).
While trying to solve this problem I modified a copy of the data from one of the nodes to try and adjust the _dbs metadata. However, I think I only made a mess of the metadata.
I am thinking that in order to fix this problem I can:
Change the IP address of one of the nodes to match an old IP address.
Start the cluster on just this instance with an original copy of the data.
Tell CouchDB to delete the _membership info about the two non-functioning nodes from the cluster.
Add two new nodes to the cluster with new IP addresses.
... and then maybe the cluster will sync the shards from the one node to all three nodes and everything will work?
Still trying to figure out how CouchDB 2.1.1 clustering works so any information beyond that CouchDB docs is appreciated.

What address should i use for listen_address in cassandra.yaml ?

I am trying to set up a multinode cassandra database on two different machines.
How am i supposed to configure the cassandra.yaml file?
The datastax documentation says
listen_address¶
(Default: localhost ) The IP address or hostname that other Cassandra nodes use to connect to this node. If left unset, the hostname must resolve to the IP address of this node using /etc/hostname, /etc/hosts , or DNS. Do not specify 0.0.0.0.
When i use 'localhost' as the value of listen_address, it runs fine on the local machine , and when i use my ip address, it fails to connect. Why so?
Configuring the nodes and seed nodes is fairly simple in Cassandra but certain steps must be followed. The procedure for setting up a multi node cluster is well documented and I will quote from the linked document.
I think it is easier to illustrate the set up of nodes with 4 instead of 2 since 2 nodes would make little sense to a running Cassandra instance. If you had 4 nodes split between 2 machines and 1 seed node on each machine the conceptual configuration would appear as follows:
node1 86.82.155.1 (seed 1)
node2 86.82.155.2
node3 192.82.156.1 (seed 2)
node4 192.82.156.2
If each of these machines is the same in terms of layout you can use the same cassandra.yaml file across all nodes.
If the nodes in the cluster are identical in terms of disk layout, shared libraries, and so on, you can use the same copy of the cassandra.yaml file on all of them
You will need to set the IP address up under the -seeds configuration in cassandra.yaml.
-seeds: internal IP address of each seed node
parameters:
- seeds: "86.82.155.1,192.82.156.1"
Understanding the difference between a node and seed node is important. If you get these IP addresses crossed you may experience issues similar to what you are describing and from your comment it appears you have corrected the configuration.
Seed nodes do not bootstrap, which is the process of a new node joining an existing cluster. For new clusters, the bootstrap process on seed nodes is skipped.
If you are having trouble grasping the node based architecture read the Achitecture in Brief document or watch the Understanding Core Concepts class.

DataStax commnity AMI installation doesn't join other nodes

I kicked off a 6 node cluster as per the documentation on http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installAMILaunch.html. All worked ok. It's meant to be a 6 node cluster - I can see the 6 nodes working on EC dashborad. I can see OpsWork working on node 0. But the nodes are not seeing each other... I dont have access to OpsWork via browser but I can ssh to each node and verify cassandra is working.
What do I need to do so that they join the cluster. Note they all in the same VPC, same subnet in the same IP range with the same cluster name. All launched using the AMI specified in the document.
Any help will be much appreciated.
Hope your listen address is configured. Add the "auto_bootstrap": false attribute to each node, and restart each node. Check the logs too. That would be of great help.
In my situation, turning on broadcast address to public-ip caused a similar issue. Make the broadcast address your private-ip, or just leave it untouched. If broadcasts address is a must have, have your architect modify the firewall rules.

apache cassandra node not joining cluster ring

I've a four node apache cassandra community 1.2 cluster in single datacenter with a seed.
All configurations are similar in cassandra.yaml file.
The following issues are faced, please help.
1] Though fourth node isn't listed in nodetool ring or status command, system.log displayed only this node isn't communicating via gossip protoccol with other nodes.
However both jmx & telnet port is enabled with proper listen/seed address configured.
2] Though Opscenter is able to recognize all four nodes, the agents are not getting installed from opscenter.
However same JVM version is installed as well as JAVA_HOME is also set in all four nodes.
Further observed that problematic node has Ubuntu 64-Bit & other nodes are Ubuntu 32-Bit, can it be the reason?
What is the version of cassandra you are using. I had reported a similar kind of bug in cassandra 1.2.4 and it was told to move to subsequent versions.
Are you using gossiping property file snitch? If that's the case, your problem should have been solved by having updated cassandra-topology.properties files that are upto date.
If all these are fine, check your TCP level connection via netstat and TCP dump.If the connections are getting dropped at application layer, then consider a rolling restart.
You statement is actually very raw. Your server level configuration might be wrong in my assumption.
I would suggest you to check if cassandra-topology.properties and cassandra-racked.properties across all nodes are consistent.

Resources