Unable to start DSE using SPARK_ENABLED=1 - apache-spark

We are running 6 node cluster with:
HADOOP_ENABLED=0
SOLR_ENABLED=0
SPARK_ENABLED=0
CFS_ENABLED=0
Now, we would like to add Spark to all of them. It seems like "adding" is not the right term because this would not fail. Anyways, the steps we've done:
1. drained one of the nodes
2. changed /etc/default/dse to SPARK_ENABLED=1 and HADOOP_ENABLED=0
3. sudo service dse restart
And got the following in the log:
ERROR [main] 2016-05-17 11:51:12,739 CassandraDaemon.java:294 - Fatal exception during initialization
org.apache.cassandra.exceptions.ConfigurationException: Cannot start node if snitch's data center (Analytics) differs from previous data center (Cassandra). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
There are two related questions that have been already answered:
Unable to start solr aspect of DSE search
Two node DSE spark cluster error setting up second node. Why?
Unfortunately, clearing the data on the node is not an option - why would I do that? I need the data to be intact.
Using "-Dcassandra.ignore_rack=true -Dcassandra.ignore_dc=true" is a bit scary in production. I don't understand why DSE wants to create another DC and why can't it just use the existing one?
I know that according to datastax's doc one should partition the load using different DC for different workloads. In our case we just want to run SPARK jobs on the same nodes that Cassandra is running using the same DC.
Is that possible?
Thanks!

The other answers are correct. The issue here is trying to warn you that you have previously identified this node as being in another DC. This means that it probably doesn't have the right data for any key-spaces with Network Topology Strategy. For example if you had a NTS keyspace which had only one replica in "Cassandra" and changed the DC to "Analytics" you could inadvertently lose all of the data.
This warning and the accompanying flag are telling you that you are doing something that you should not be doing in a production cluster.
The real solution to this is to explicitly name your dc's using GossipingFileSnitch and not rely on SimpleSnitch which names based on the DSE workload.
In this case, switch to GPFS and set the DC name to Cassandra.

Related

Nodetool rebuild not working reliably on Cassandra 3.11.3

I presently have a cassandra 3.11.3 cluster with a single DC. I recently added another dc to my cluster. And I followed the instructions #
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddDCToCluster.html
As per the instructions I ran 'nodetool rebuild -ks -- dc1 on each of the nodes. However this rebuild command did not actually work as intended. My data is partially missing in the new nodes. This I know because I sampled the data in the new dc through my app using consistency local_one. I dont see the data replenish through read repair either. Oh and I should mention that there were no errors in the logs following the rebuild command. So everything appeared to have succeeded.
What am I missing here? Is there a known issue reported on this?
You should run nodetool rebuild --<existing DC> on each node. this command will pull all keyspace data from existing data center based on allocated tokens and RF. To ensure consistency please run full repair on nodes as well.

Way to determine healthy Cassandra cluster?

I've been tasked with re-writing some sub-par Ansible playbooks to stand up a Cassandra cluster in CentOS. Quite frankly, there doesn't seem to be much information on Cassandra out there.
I've managed to get the service running on all three nodes at the same time, using the following configuration file, info scrubbed.
HOSTIP=10.0.0.1
MSIP=10.10.10.10
ADMIN_EMAIL=my#email.com
LICENSE_FILE=/tmp/license.conf
USE_LDAP_REMOTE_HOST=n
ENABLE_AX=y
MP_POD=gateway
REGION=test-1
USE_ZK_CLUSTER=y
ZK_HOSTS="10.0.0.1 10.0.0.2 10.0.0.3"
ZK_CLIENT_HOSTS="10.0.0.1 10.0.0.2 10.0.0.3"
USE_CASS_CLUSTER=y
CASS_HOSTS="10.0.0.1:1,1 10.0.0.2:1,1 10.0.0.3:1,1"
CASS_USERNAME=test
CASS_PASSWORD=test
The HOSTIP changes depending on which node the configuration file is on.
The problem is, when I run nodetool ring, each node says there's only two nodes in the cluster: itself and one other, seemingly random from the other two.
What are some basic sanity checks to determine a "healthy" Cassandra cluster? Why is nodetool saying each one thinks there's a different node missing from the cluster?
nodetool status - overview of the cluster (load, state, ownership)
nodetool info - more granular details at the node-level
As for the node mismatch I would check the following:
cassandra-topology.properties - identical across the cluster (all 3 IPs listed)
cassandra.yaml - I typically keep this file the same across all nodes. The parameters that MUST stay the same across the cluster are: cluster_name, seeds, partitioner, snitch).
verify all nodes can reach each other (ping, telnet, etc)
DataStax (Cassandra Vendor) has some good documentation. Please note that some features are only available on DataStax Enterprise -
http://docs.datastax.com/en/landing_page/doc/landing_page/current.html
Also check out the Apache Cassandra site -
http://cassandra.apache.org/community/
As well as the user forums -
https://www.mail-archive.com/user#cassandra.apache.org/
Actually, the thing you really want to check is if all the nodes "AGREE" on schema_id. nodetool status shows if nodes or up, down, joining, yet it does not really mean 'healthy' enough to make schema changes or do other changes.
The simplest way is:
nodetool describecluster
Cluster Information:
Name: FooBarCluster
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
43fe9177-382c-327e-904a-c8353a9df590: [10.136.2.1, 10.136.2.2, 10.136.2.3]
If schema IDs do not match, you need to wait for schema to settle, or run repairs, say for example like this:
43fe9177-382c-327e-904a-c8353a9df590: [10.136.2.1, 10.136.2.2]
43fe9177-382c-327e-904a-c8353a9dxxxx: [10.136.2.3]
However, running nodetool is 'heavy' and hard to parse.
The information is inside the database, you can check here:
'SELECT schema_version, release_version FROM system.local' and
'SELECT peer, schema_version, release_version FROM system.peers'
Then you compare schema_version across all nodes... if they match, the cluster is very likely healthy. You should ALWAYS check this before making any changes to schema.
Now, during a rolling upgrade, when changing engine versions, the release_version is different, so to support automatic rolling upgrades, you need to check schema_id matching within release_versions separately.
I'm not sure all of the problems you might be having, but...
Check the cassandra.yaml file. You need minimum 3 things to be the same - seeds: list (but do not list all nodes as seeds!), cluster_name, and snitch. Make sure your listen_address is correct.
If you are using gossipingPropertyFileSnitch then check cassandra-topology.properties and/or cassandra-rackdc.properties files for accuracy.
Don't start all the nodes at the same time. Start the seed nodes 1st - the other nodes will "gossip" with the seed node to learn cluster topology. Shutdown the seed nodes last.
Don't use shared storage. That defeats the purpose of distributed data and is considered a cassandra anti-pattern.
If you're in AWS, don't use auto-scaling groups unless you know what you're doing.
Once you've done all that, use nodetool status | ring | info or jmx to see what the cluster is doing.
Datastax does have decent documentation for cassandra.

Temporarily change multi node to single node

I have configured cassandra 3.0.9 on 3 nodes but I have to use only 1 node for sometime. I have disconnected other 2 nodes from network also removed the entries of both the nodes from Cassandra.yaml, rackdc and topology files.
When I check node tool status it shows me both the down nodes. When I try to execute any query on cqlsh it gives me below error:
Blockquote
OperationTimedOut: errors={'127.0.0.1': 'Request timed out while waiting for schema agreement. See Session.execute_async and Cluster.max_schema_agreement_wait.'}, last_host=127.0.0.1
Blockquote
Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers.
How I can resolve this?
That's not how you remove a node from a Cassandra cluster. In fact, what you're doing is quite dangerous. Typically, you'd use nodetool decommission. If your other two nodes are still intact and just offline, I suggest bringing them back online temporarily and let decommission do its thing.
I'm going to also throw this out there - it's possible you're missing a good portion of your data with the steps you did above unless all keyspaces had RF=3. Cassandra distributes data evenly between the nodes in a respective DC. The decommission step I mention above redistributes the data.
Now if you don't have the other 2 nodes to run a nodetool decommission, you may have to remove the node with nodetool removenode and in the worst case, nodetool assassinate.
Check these docs for reference and the full steps to removing a node: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddingRemovingNodeTOC.html

Cassandra new node bootstrapping while it should not

I'm adding a new datacenter to an existing cluster and I'm following this "procedure".
However the first node I start is apparently bootstrapping: the load information from nodetool status keeps growing...
I added
auto_boostrap: false
in cassandra.yaml.
Am I missing something?
By adding auto_bootstrap: false , you tell the node not to bootstrap - that DOESNT tell it not to take any writes. What is the replication settings for the new datacenter? Did you already enable it in the various keyspaces? If so, it will participate in writes.
When you say you see the load increasing, is it streaming? Do you see files being transferred in nodetool netstats? Is the node Up, Normal or Up, Joining?
check your cassandra-rackdc.properties and don't forget to change DC(dc=DC1) name in that file if you are creating a new datacenter otherwise it will consider the new node as of same datacenter.

Cassandra ring configuration

I am trying to connect Apache Cassandra nodes into a ring. They are not Datastax versions, but Cassandra 1.2.8 from the Apache website. When trying to add one as the seed of the other I get following exception:
Unable to find compaction strategy class 'com.datastax.bdp.hadoop.cfs.compaction.CFSCompactionStrategy'
Before that I change the "listen_address" and "rpc_address" to local IP address of each node. Next step I add one IP as a seed to another node. The nodes start up, an exception is printed but both nodes run fine until restart. After restarting either node the exception is printed and nodes do not run.
This is very strange - I do not have any DSE components.
Did you previously use any DSE components? If you did and are using the same data directory on any of your nodes, it may find old column families that were created with this compaction strategy. If you have no data you want in the data directories on all your nodes, you should clear them by stopping all nodes, deleting the directories, then starting the nodes.
Or if you have any DSE nodes still up, they may be joining the new cluster and propagating their schema, so creating column families with this compaction strategy. You can find out by looking in the logs and seeing which nodes try to connect. If any aren't from your 1.2.8 ring then this is probably the cause.
That error means you either had a DSE Analytics node in your ring at some point, or you restored your schema from someplace that had an Analytics node.
I would check if you have the folder /etc/dse/ on your VM, that would mean DSE was installed there.
To just wipe the node and start from scratch schema wise, you can stop the node, remove the /system/schema_* folders, then start the node. When it starts it will have no schema. Re-create any keyspaces/column families you had before, and they will get read from disk.

Resources