Adding new keyspace in existing production cassandra cluster - cassandra

I" have an existing cassandra cluster running in AWS. It has total 6 nodes in the same data center but in multiple regions. We are using cassandra version 2.2.8 in production. There are two existing keyspaces already present in the production environment. I want to add a new keyspace to the production cluster.
I am new to Cassandra so looking for following answers:
Can I add new keyspace in the existing production cluster without taking the cluster down?
Any best practices you would recommend to add the new keyspace to the existing cluster.
Possible steps to add new Keyspace?
I really appreciate your help!

Yes, you can add keyspaces online.
When you add a keyspace, you have to choose the Replication Factor. As you have AWS Multi Region, probably you are using Ec2MultiRegionSnitch as endpoint_snitch, right?
If you do, probably you configured dc_suffix=_XYZ and now you have your DCs like this: "us-east_XYZ" (See on nodetool status).
Then, you can use something like this:
CREATE KEYSPACE my_keysace
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy','us-east_XYZ' : 2, 'us-west_XYZ':2 }
AND DURABLE_WRITES = true
See docs: CREATE KEYSPACE

Related

Cassandra Replication not working accross data centers

I am new to Cassandra and have configured cassandra cluster as multiple aws data center.
I have 3 replicat in eu-central-1 and 3 replicat in eu-west-1.
I have created keyspace from eu-central-1 seed as following: CREATE KEYSPACE my_test WITH REPLICATION = {'class':'NetworkTopologyStrategy', 'eu-west-1':'3', 'eu-central-1':'3'}; after that I have created several tables under this keyspace.
Those keyspace and tables didn't replicated to the eu-west-1 3 replicat, should those keyspace and the tables be replicated to eu-west-1 seeds automatically ? if yes , what's wrong with my configurations.
Yes, whatever tables belong in the keyspace my_test should have replicated to both DCs.
How are you determining that the tables have not replicated? I'd be happy to update my answer when you update your original question.
Since you're new to Cassandra, I recommend datastax.com/dev which has links to free hands-on tutorials where you can quickly learn the basics of Cassandra.
This tutorial is a good place to start -- datastax.com/try-it-out.
We also have FREE live workshops where you get to learn hands-on in a fun environment with other participants and have a chance to win prizes. Have a look at the list of upcoming workshops here -- datastax.com/workshops. Cheers!

Cassandra backup system keyspaces

I have 3 node cassandra cluster and I have a script which backup all of the keyspaces, but when it comes to restore on fresh cluster, data keyspaces restored correctly, but system_* keyspaces not.
So it is necessary to backup system keyspaces in cassandra?
You will need to backup the keyspace system_schema at the same time, as it will contain the definition of keyspaces, tables, and columns. The other system* keyspaces should be left untouched.
Fresh cluster makes own setup like token ranges etc based on configurations.you can restore the cluster on new cluster but you need to create schema and configuration same as old cluster. There is many ways to take backup and restore procedure based on requirements as below:-
https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsBackupRestore.html

Restore snapshots from 3 node Cassandra cluster to a new 6 node cluster

I am new to cassandra and would like some help on restoring snapshots from 3 node Cassandra cluster to a new 6 node cluster.
We have few keyspaces and would like to copy data from dev to production.
Thanks in advance.
The easiest way is to use the sstableloader tool that is bundled with Cassandra. You can find it in %installdir%/bin/sstableloader.
You will first need to re-create the schema on your new cluster:
dump the schema for the keyspace you want to transfer from your original cluster using cqlsh -e 'DESC KEYSPACE mykeyspace;' > mykeyspace.cql
load it into your new cluster using cqlsh -f mykeyspace.cql.
(optional) If you new cluster will have a different replication configuration you'll need to modify it manually after loading the schema. (ALTER KEYSPACE mykeyspace WITH REPLICATION = ...;)
Once that's done, you can start bulk-loading the SSTables from your keyspace snapshots into the new cluster:
sstableloader --nodes 10.0.0.1,10.0.0.2 -f /etc/cassandra/cassandra.yaml /path/to/mykeyspace/snapshot/
Note that this might take a while if you have a lot of data to load. You should also run a full repair on the new cluster afterwards to ensure that the replicas are properly distributed.

Copying Cassandra data from one cluster to another

I have a cluster setup in Cassandra on AWS. Now, I need to move my cluster to some other place. Since taking image is not possible, I will create a new cluster exactly a replica of the old one. Now I need to move the data from this cluster to another. How can I do so?
My cluster has 2 data centers and each data center has 3 Cassandra machines with 1 seed machine.
Do you have connectivity between the old and new cluster? If yes, why not linking the cluster and let cassandra replicate the data to the new cluster? After data is transferred, shut down the old cluster. Ideally you wouldn`t even have any downtime.
You can take your SSTables from your data directory and then can use sstableloader in new data center to import the data.
Before doing this activity you might consider doing compaction so that you have only one SSTable per table.
SSTable Loader
Using SFTP server or through some other way, transfer the SSTables from old cluster to new cluster (one DC is enough) and use SSTableLoader. The data replication to another DC will be taken care by Cassandra.
In cassandra there are two type of strategy SimpleStrategy and NetworkTopologyStrategy by using NetworkTopologyStrategy you can replicate in different cluster. see this documentation Data replication
You can use COPY command to export and import csv from one table to another
Simple data importing and exporting with Cassandra

Cassandra and Spark

Hi I have a high level question regarding cluster topology and data replication with respect to cassandra and spark being used together in datastax enterprise.
It was my uderstanding that if there were 6 nodes in a cluster and there is heavy computing (e.g analytics) done then you could have three spark nodes and 3 cassandra nodes if you want. Or you don't need three nodes for analytics but your jobs would not run as fast. The reason you don't want the heavy analytics on the cassandra nodes is because the local memory is already being used up to handle the heavy read/write load of cassandra.
This much is clear, but here are my questions :
How does the replicated data work then?
Are all the cassandra only nodes in one rack, and all the spark nodes in another rack?
Does all the data get replicated to the spark nodes?
How does that work if it does?
What is the recommended configuration steps to make sure the data is replicated properly to the spark nodes?
How does the replicated data work then?
Regular Cassandra replication will operate between nodes and DC's. As far as replication goes this is the same as having a c* only cluster with two data centers.
Are all the cassandra only nodes in one rack, and all the spark nodes in another rack?
With the default DSE Snitch, your C* nodes will be in one DC and the Spark nodes in another DC. They will all be in a default rack. If you want to use multiple racks you will have to configure that yourself by using an advanced snitch. GPFS or PFS are good choices depending on your orchestration mechanisms. Learn more in the DataStax Documentation
Does all the data get replicated to the spark nodes? How does that work if it does?
Replication is controlled at the keyspace level and depends on your replication strategy:
SimpleStrategy will simply ask you the number of replicas you want in your cluster (it is not data center aware so don't use it if you have multiple DC's)
create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3 }
This assumes you only have one DC and that you'll have 3 copies of each bit of data
NetworkTopology strategy let's you pick number of replicas per DC
create KEYSPACE tst WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2': 3 }
You can choose to have a different number of replicas per DC.
What is the recommended configuration steps to make sure the data is replicated properly to the spark nodes?
The procedure to update RF is in the datastax documentation. Here it is verbatim:
Updating the replication factor Increasing the replication factor
increases the total number of copies of keyspace data stored in a
Cassandra cluster. If you are using security features, it is
particularly important to increase the replication factor of the
system_auth keyspace from the default (1) because you will not be able
to log into the cluster if the node with the lone replica goes down.
It is recommended to set the replication factor for the system_auth
keyspace equal to the number of nodes in each data center.
Procedure
Update a keyspace in the cluster and change its replication strategy
options. ALTER KEYSPACE system_auth WITH REPLICATION = {'class' :
'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2}; Or if using
SimpleStrategy:
ALTER KEYSPACE "Excalibur" WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor' : 3 }; On each affected node,
run the nodetool repair command. Wait until repair completes on a
node, then move to the next node.
Know that increasing the RF in your cluster will generate lots of IO and CPU utilization as well as network traffic, while your data gets pushed around your cluster.
If you have a live production workload, you can throttle the impact by using nodetool getstreamthroughput / nodetool setstreamthroughput.
You can also throttle the resulting compactions with nodetool getcompactionthroughput nodetool setcompactionthroughput
How does Cassandra and Spark work together on the analytics nodes and
not fight for resources? If you are not going to limit Cassandra at all in the whole cluster, then what is the point of limiting Spark, just have all the nodes Spark enabled.
The key point is that you won't be pointing your main transactional reads / writes at the Analytics DC (use something like consistency level ONE_LOCAL, or QUORUM_LOCAL to point those requests to the C* DC). Don't worry, your data still arrives at the analytics DC by virtue of replication, but you won't wait for acks to come back from analytics nodes in order to respond to customer requests. The second DC is eventually consistent.
You are right in that cassandra and spark are still running on the same boxes in the analytics DC (this is critical for data locality) and have access to the same resources (and you can do things like control the max spark cores so that cassandra still has breathing room). But you achieve workload isolation by having two Data Centers.
DataStax drivers, by default, will consider the DC of the first contact point they connect with as the local DC so just make sure that your contact point list only includes machines in the local (c* DC).
You can also specify the local datacenter yourself depending on the driver. Here's an example for the ruby driver, check the driver documentation for other languages.
use the :datacenter cluster method: First datacenter found will be
assumed current by default. Note that you can skip this option if you
specify only hosts from the local datacenter in :hosts option.
You are correct, you want to separate your cassandra and your analytics workload.
A typical setup could be:
3 Nodes in one datacenter (name: cassandra)
3 Nodes in second datacenter (name: analytics)
When creating your keyspaces you define them with a NetworkTopologyStrategy and a replication factor defined for each datacenter, like so:
CREATE KEYSPACE myKeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'cassandra': 2, 'analytics': 2};
With this setup, your data will be replicated twice in each datacenter. This is done automatically by cassandra. So when you insert data in DC cassandra the inserted data will get replicated to DC analytics automatically and vice versa. Note: you can define what data is replicated by using seperate keyspaces for the data you want to be analyzed and the data you don't.
In your cassandra.yaml you should use the GossipingPropertyFileSnitch. With this snitch you can define the DC and the rack of your node in the file cassandra-rackdc.properties. This information then gets propagated via the gossip protocol. So each node learns the topology of your cluster.

Resources