DSE Analytics in Cluster Configuration - apache-spark

Previously we had three nodes cluster with two Cassandra nodes datacenter in one dc and one spark enabled node in different dc.
Spark was running smoothly in that configurations.
Then we tried adding another node in analytics dc with spark enabled. We had configured GossipingPropertyFileSnitch as well as added seeds.
But now when we start the cluster, spark master is assigned to both the nodes separately. So spark job still runs on a single node. What configurations are we missing regarding running spark job in a cluster?

Most probably you didn't make an adjustments in the Analytics keyspace replication, or didn't run the repair after you added a node. Please refer to instructions in official documentation.
Also, please check that you configured the same DC for both of Analytics nodes, because the Spark master is elected per DC.

Related

Is it possible to backup a 6-node DataStax Enterprise cluster and restore it to a new 4-node cluster?

I have this case. We have 6 nodes DSE cluster and the task is to back it up, and restore all the keyspaces, tables and data into a new cluster. But this new cluster has only 4 nodes.
Is it possible to do this?
Yes, it is definitely possible to do this. This operation is more commonly referred to as "cloning" -- you are copying the data from one DataStax Enterprise (DSE) cluster to another.
There is a Cassandra utility called sstableloader which reads the SSTables and loads it to a cluster even when the destination cluster's topology is not identical to the source.
I have previously documented the procedure in How to migrate data in tables to a new Cassandra cluster which is also applicable to DSE clusters. Cheers!

Migrate Datastax Enterprise Cassandra to Apache Cassandra

We have currently using DSE 4.8 and 5.12. we want to migrate to apache cassandra .since we don't use spark or search thought save some bucks moving to apache. can this be achieved without down time. i see sstableloader works other way. can any one share me the steps to follow to migrate from dse to apache cassandra. something like this from dse to apache.
https://support.datastax.com/hc/en-us/articles/204226209-Clarification-for-the-use-of-SSTABLELOADER
Figure out what version of Apache Cassandra is being run by DSE. Based on the DSE documentation DSE 4.8.14 is using Apache Cassandra 2.1 and DSE 5.1 is using Apache Cassandra 3.11
Simplest way to do this is to build another DC (Logical DC per Cassandra) and add it to the existing cluster.
As usual, with a "Nodetool Rebuild {from-old-DC}" on to the new DC nodes, let Cassandra take care of streaming data to the new Apache Cassandra nodes naturally.
Once data streaming is completed, based on the LoadBalancingPolicy being used by applications, switch their local_dc to DC2 (the new DC). Once the new DC starts taking traffic, shutdown nodes in old DC say DC1 one by one.
alter keyspace dse_system and dse_security not using everywhere
on non-seed nodes, cleanup cassandra data directory
turn on replace in cassandra-env.sh
start instance
monitoring streaming process using command 'nodetool netstats|grep Receiving'
change seeds node definition and rolling restart before finally migrate previous seeds nodes.

How to setup Spark with a multi node Cassandra cluster?

First of all, I am not using the DSE Cassandra. I am building this on my own and using Microsoft Azure to host the servers.
I have a 2-node Cassandra cluster, I've managed to set up Spark on a single node but I couldn't find any online resources about setting it up on a multi-node cluster.
This is not a duplicate of how to setup spark Cassandra multi node cluster?
To set it up on a single node, I've followed this tutorial "Setup Spark with Cassandra Connector".
You have two high level tasks here:
setup Spark (single node or cluster);
setup Cassandra (single node or cluster);
This tasks are different and not related (if we are not talking about data locality).
How to setup Spark in Cluster you can find here Architecture overview.
Generally there are two types (standalone, where you setup Spark on hosts directly, or using tasks schedulers (Yarn, Mesos)), you should draw upon your requirements.
As you built all by yourself, I suppose you will use Standalone installation. The difference between one node is network communication. By default Spark runs on localhost, more commonly it uses FQDNS name, so you should configure it in /etc/hosts and hostname -f or try IPs.
Take a look at this page, which contains all necessary ports for nodes communication. All ports should be open and available between nodes.
Be attentive that by default Spark uses TorrentBroadcastFactory with random ports.
For Cassandra see this docs: 1, 2, tutorials 3, etc.
You will need 4 likely. You also could use Cassandra inside Mesos using docker containers.
p.s. If data locality it is your case you should come up with something yours, because nor Mesos, nor Yarn don't handle running spark jobs for partitioned data closer to Cassandra partitions.

DSE changing cassandra to use vnodes

We want to deploy DSE cluster of 3 nodes, where each node is Analytics running Spark.
We want to use vnodes in cassandra, because it enables much more even data distribution and easier adding of the nodes. We deploy DSE on AWS, using one of the available AMI images.
Although DSE by default deploys Cassandra cluster using single token nodes, we have to manually change cassandra.yaml file on all the nodes.
According to datastax documentation, I should:
uncomment num_tokens field (I left 256 default value)
leave initial_token field unassigned
After that, when I do nodetool status command, I see that my cluster still uses single token mode.
According to this, I should restart nodes in the cluster, so that changes take effect.
But after nodes are restarted both thru OPS center or AWS console, I get errors, nodes are in unresponsive state, and I cannot use nodetool command on my nodes, with error:
Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused'.
Is there something that I am doing wrong?
How to enable vnodes on DSE when deployed using AMI image?
Thank you

Cassandra clusters separation

I have one single-node cluster and just added a multi-node cluster (on 4 seprate nodes, let's call them node1, node2,.., node4). The single-node cluster uses the localhost as seed_provider. The multi-node uses node1,node2 hosts as seeds (SimpleSeedProvider).
To my suprise when I started the multi-node cluster I was able to see they started talking to the single-node Cassandra and they downloaded data from it.
How to prevent the new cluster talking to the existing cluster? Do I miss anything else?
They will "gossip" on the network and detect each other if they are not separated.
Did you make sure your cluster_name value in your cassandra.yaml file is not the same for both of your clusters? That's how they differentiate each other as said in the sample configuration file :
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.

Resources