apache cassandra node not joining cluster ring - cassandra

I've a four node apache cassandra community 1.2 cluster in single datacenter with a seed.
All configurations are similar in cassandra.yaml file.
The following issues are faced, please help.
1] Though fourth node isn't listed in nodetool ring or status command, system.log displayed only this node isn't communicating via gossip protoccol with other nodes.
However both jmx & telnet port is enabled with proper listen/seed address configured.
2] Though Opscenter is able to recognize all four nodes, the agents are not getting installed from opscenter.
However same JVM version is installed as well as JAVA_HOME is also set in all four nodes.
Further observed that problematic node has Ubuntu 64-Bit & other nodes are Ubuntu 32-Bit, can it be the reason?

What is the version of cassandra you are using. I had reported a similar kind of bug in cassandra 1.2.4 and it was told to move to subsequent versions.
Are you using gossiping property file snitch? If that's the case, your problem should have been solved by having updated cassandra-topology.properties files that are upto date.
If all these are fine, check your TCP level connection via netstat and TCP dump.If the connections are getting dropped at application layer, then consider a rolling restart.
You statement is actually very raw. Your server level configuration might be wrong in my assumption.
I would suggest you to check if cassandra-topology.properties and cassandra-racked.properties across all nodes are consistent.

Related

Unable to start DSE using SPARK_ENABLED=1

We are running 6 node cluster with:
HADOOP_ENABLED=0
SOLR_ENABLED=0
SPARK_ENABLED=0
CFS_ENABLED=0
Now, we would like to add Spark to all of them. It seems like "adding" is not the right term because this would not fail. Anyways, the steps we've done:
1. drained one of the nodes
2. changed /etc/default/dse to SPARK_ENABLED=1 and HADOOP_ENABLED=0
3. sudo service dse restart
And got the following in the log:
ERROR [main] 2016-05-17 11:51:12,739 CassandraDaemon.java:294 - Fatal exception during initialization
org.apache.cassandra.exceptions.ConfigurationException: Cannot start node if snitch's data center (Analytics) differs from previous data center (Cassandra). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
There are two related questions that have been already answered:
Unable to start solr aspect of DSE search
Two node DSE spark cluster error setting up second node. Why?
Unfortunately, clearing the data on the node is not an option - why would I do that? I need the data to be intact.
Using "-Dcassandra.ignore_rack=true -Dcassandra.ignore_dc=true" is a bit scary in production. I don't understand why DSE wants to create another DC and why can't it just use the existing one?
I know that according to datastax's doc one should partition the load using different DC for different workloads. In our case we just want to run SPARK jobs on the same nodes that Cassandra is running using the same DC.
Is that possible?
Thanks!
The other answers are correct. The issue here is trying to warn you that you have previously identified this node as being in another DC. This means that it probably doesn't have the right data for any key-spaces with Network Topology Strategy. For example if you had a NTS keyspace which had only one replica in "Cassandra" and changed the DC to "Analytics" you could inadvertently lose all of the data.
This warning and the accompanying flag are telling you that you are doing something that you should not be doing in a production cluster.
The real solution to this is to explicitly name your dc's using GossipingFileSnitch and not rely on SimpleSnitch which names based on the DSE workload.
In this case, switch to GPFS and set the DC name to Cassandra.

Cassandra multinode cluster setup issue

I am trying to setup a Cassandra multinode cluster on CentOS 7 with OpenJDK.
I have 2 Nodes:
node1 10.99.189.49
node2 10.99.189.50
I have done following things till now:
Downloaded the tarball of Cassandra from PlanetCassandra site
Extracted it in Documents folder.
Created all the necessary directories (data/saved_cache, data/commitlog, data/data) as mentioned in the YAML file.
And I have made 3 changes in my conf/cassandra.yaml file as follows:
On node 10.99.189.49:
seeds: "10.99.189.49"
listen_address: 10.99.189.49
rpc_address: 10.99.189.49
On node 10.99.189.50:
seeds: "10.99.189.49"
listen_address: 10.99.189.50
rpc_address: 10.99.189.50
Now I run cassandra on node 10.99.189.49
and then I run cassandra on the other node.
Cassandra starts normally on both the nodes
BUT
when I do:
bin/nodetool status
I can see only one node in it.
Can anyone point what I am doing wrong or missing something?
So I started adding tips in the comments, and for my 3rd time around I thought I'd start putting them all together in an actual answer.
DataStax does a pretty good job documenting how this should work. Make sure that you've gone through these docs (specifically the first one) and that you're following all the steps:
Initializing a multiple node cluster (single data center)
Adding nodes to an existing cluster
In addition to everything you have mentioned above, make sure that the cluster_name is the same on each node.
I find it easier to make this work using the GossipingPropertyFileSnitch. Set that in your cassandra.yaml on each node:
endpoint_snitch: GossipingPropertyFileSnitch
Then make sure that each of your nodes is specifying the same default data center in the cassandra-rackdc.properties file:
dc=DC1
Get your first node (.49) up-and-running. Verify it with nodetool status.
Also verify that you have opened the necessary ports in your firewall. From .49, try telneting your way to the other node on the ports that Cassandra requires. I recommend 7000, as that is the port for non-SSL inter-node communication.
telnet 10.99.189.50 7000
Once you're sure all that works and everything is configured properly, then bring up .50. I remember reading that you should wait at least 2 minutes before bringing up another node, so do that just to be on the safe side. Tail the logs to make sure it handshakes with the other node, or to see any errors:
tail -f /var/log/cassandra/system.log
Notes: Your log location may vary. I'm assuming you're running 2.2. If you are using a different version of Cassandra, please indicate it.
Hope this helps!
On both node use
seeds: "10.99.189.49,10.99.189.50"
and also restart both node cassandra

OpsCenter is having trouble connecting with the agents

I am trying to setup a two node cluster in Cassandra. I am able to get my nodes to connect fine as far as I can tell. When I run nodetool status it shows both my nodes in the same data center and same rack. I can also run cqlsh on either node and query data. The second node can see data from the first node, etc.
I have my first node as the seed node both in the Cassandra.yaml and the cluster config file.
To avoid any potential security issues, I flushed my iptable and allowed all on all ports for both nodes. They are also on the same virtual network.
iptables -P INPUT ACCEPT
When I start OpsCenter on either machine it sees both nodes but only has information on the node I am viewing OpsCenter on. It can tell if the other node is up/down, but I am not able to view any detailed information. It sometimes initially says 2 Agents Connected but after awhile it says 1 agent failed to connect. It keeps prompting me to install OpsCenter on the other node although it's already there.
The OpsCenterd.log doesn't reveal much. There don't appear to be any errors but I see INFO: Nodes with agents that appear to no longer be running .
I am not sure what else to check as everything but OpsCenter seems to be working fine.
You should install Opscenter on a single node rather than all nodes. The opscenter gui will then prompt you to install the agent on each of the nodes in the cluster. Use nodetool status or nodetool ring to make sure that the cluster is properly functioning and all nodes are Up and functioning Normally. (status = UN)
In address.yaml file you can set stomp_address equal to the ip address of the opscenter server to force the agents to the correct address.

Node actions apply to only 1 node

I just installed a 3 node cassandra (2.0.11) community cluster with a single seed node. I installed opscenter (5.0.2) on the seed node and everything is working fairly well. The only issue I am having is that any node actions I perform (stop, start, compact, etc) apply only to the seed node. Even if I choose a different node on the ring or list, the action always happens on the seed node.
I watched the opscenter logs and can see requests for /ops/compact/ip_address and the ip address is the correct node that I chose but the action always run on the seed instance.
All agents have been installed on all the nodes and the cluster is fully operational. I can run nodetool compact on each node and see the compaction progress in opscenter.
I have each node configured to listen on an internal address and have verified that the rpc server is open on the network. I have also tried adding the cluster using a non-seed node but all actions continue to run on the seed node.
Posted the answer above but I'll explain more in detail for anyone else with this issue.
I changed rpc_address and listen_address in cassandra.yaml in order to listen on a private ip address. I restarted cassandra and the cluster could communicate easily. The datastax-agent was still reporting 127.0.0.1 to opscenter as the rpc address. I found this out by enabling trace logging in opscenter.
If you modify anything in cassandra.yaml, make sure you restart the datastax-agent as it apparently caches the data.

Cassandra hangs on arbitrary commands

We're hosting Cassandra 2.0.2 cluster on AWS. We've recently started upgrading from normal to SSD drives, by bootstrapping new and decommissioning old nodes. It went fairly well, aside from two nodes hanging forever on decommission. Now, after the new 6 nodes are operational, we noticed that some of our old tools, using phpcassa stopped working. Nothing has changed with security groups, all ports TCP/UDP are open, telnet can connect via 9160, cqlsh can 'use' a cluster, select data, however, 'describe cluster' fails, in cli, 'show keyspaces' also fails - and by fail, I mean never exits back to prompt, nor returns any results. The queries work perfectly from the new nodes, but even the old nodes waiting to be decommissioned cannot perform them. The production system, also using phpcassa, does normal data requests - it works fine.
All cassandras have the same config, the same versions, the same package they were installed from. All nodes were recently restarted, due to seed node change.
Versions:
Connected to ### at ####.compute-1.amazonaws.com:9160.
[cqlsh 4.1.0 | Cassandra 2.0.2 | CQL spec 3.1.1 | Thrift protocol 19.38.0]
I've run out out of ideas. Any hints would be greatly appreciated.
Update:
After a bit of random investigating, here's a bit more detailed description.
If I cassandra-cli to any machine, and do "show keyspaces", it works.
If I cassandra-cli to a remote machine, and do "show keyspaces", it hangs indefinitely.
If I cqlsh to a remote cassandra, and do a describe keyspaces, it hangs. ctrl+c, repeat the same query, it instantly responds.
If I cqlsh to a local cassandra, and do a describe keyspaces, it works.
If I cqlsh to a local cassandra, and do a select * from Keyspace limit x, it will return data up to a certain limit. I was able to return data with limit 760, the 761 would fail.
If I do a consistency all, and select the same, it hangs.
If I do a trace, different machines return the data, though sometimes source_elapsed is "null"
Not to forget, applications querying the cluster sometimes do get results, after several attempts.
Update 2
Further playing introduced failed bootstrapping of two nodes, one hanging on bootstrap for 4 days, and eventually failing, possibly due to a rolling restart, and the other plain failing after 2 days. Repairs wouldn't function, and introduced "Stream failed" errors, as well as "Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException". Also, after executing repair, started getting "Read an invalid frame size of 0. Are you using tframedtransport on the client side?", so..
Solution
Switch rpc_server_type from hsha to sync. All problems gone. We worked with hsha for months without issues.
If someone also stumbles here:
http://planetcassandra.org/blog/post/hsha-thrift-server-corruption-cassandra-2-0-2-5/
In cassandra.yaml:
Switch rpc_server_type from hsha to sync.

Resources