Cassandra Cluster Replication- Existing Node & Existing Data - azure

We have one requirement where we need to replicate Cassandra Cluster with existing nodes and existing data in it. Approx 2.5 TB of data is on Azure and 3.5 TB on AWS. We need to pull the remaining data from AWS to Azure. Your Kind Help is appreciated.

There are many options here.
You can connect the two using GPFS - stand up a DC in Azure replicate across remove the old DC.
You could unload the data via the Cassandra loader. https://github.com/brianmhess/cassandra-loader
You could take a snapshot and then stream the data to the new cluster via sstableloader.
It's hard to give a complete answer - it would depend on so many factors. The above should get you started at least.

Related

Does Scylla DB have a similar migration support to GKE as K8ssandra's Zero Downtime Migration feature?

We are trying to migrate our ScyllaDB cluster deployed on GCE machines to the GKE cluster in Google Cloud, we came across one approach of Cassandra migration and want to implement the same here in ScyllaDB migration. Below is the link for the same, can you please suggest if this is possible in Scylla ?
or if Scylla hasn't introduced such a migration technique with the Scylla K8S operator ?
https://k8ssandra.io/blog/tutorials/cassandra-database-migration-to-kubernetes-zero-downtime/
Adding a new "destination" DC to your existing cluster "source" DC, is a very common technic to migrate to a new DC.
Add the new "destination" DC
Change replication factor settings accordingly
nodetool rebuild --> stream data from the "source" DC to the "destination" DC
nodetool repair the new DC.
Update your application clients to connect to the new DC once it's ready to serve (all data streamed + repaired)
Decommission the "old" (source) DC
For the gory details see here:
https://docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/add-dc-to-existing-dc.html
https://docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/decommissioning-data-center.html
If you prefer to go the full scan route. CQL reads on the source and CQL writes on the destination, with some ability for data manipulation and save points to resume from, than the Scylla Spark Migrator is a good option.
https://github.com/scylladb/scylla-code-samples/tree/master/spark-scylla-migrator-demo
You can also use the Scylla Spark migrator to migrate parquet files
https://www.scylladb.com/2020/06/10/migrate-parquet-files-with-the-scylla-migrator/
Remember not to migrate Materialized views (MV), you can always re-create them post migration again from the base tables.
We use an Apache Spark-based Migrator: https://github.com/scylladb/scylla-migrator
Here's the blog we wrote on how to do this back in 2019: https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/
Though in this case, you aren't moving from Cassandra to ScyllaDB; just moving from one ScyllaDB instance to another. If this makes sense to you, it should be straight forward. If you have questions, feel free to join our Slack community to get more interactive assistance:
http://slack.scylladb.com/

Cassandra cluster bulk loader hangs during export

I am migrating a simple 4 node Cassandra cluster from one cloud provider to another. The number of nodes in both the clouds are same however the newer cluster is at version 3.11.0 and the older one is at 3.0.11. I am using sstableloader to stream data from one cluster to another (schema has been created on new cluster separately). As per the release notes this should not be a problem.
However, for certain column families with sstableloader I get progress to 100% but then it hangs there for hours (time hang >> time to stream). The total data to stream on each node is below 500 GB. Any help on why this is happening and how to avoid is appreciated.
Create a new node and add to the existing cluster from the new cloud server
Flush tables from the memtable to SSTables on disk
Delete one node from old cloud server. Like wise repeat for each node.

Copying Cassandra data from one cluster to another

I have a cluster setup in Cassandra on AWS. Now, I need to move my cluster to some other place. Since taking image is not possible, I will create a new cluster exactly a replica of the old one. Now I need to move the data from this cluster to another. How can I do so?
My cluster has 2 data centers and each data center has 3 Cassandra machines with 1 seed machine.
Do you have connectivity between the old and new cluster? If yes, why not linking the cluster and let cassandra replicate the data to the new cluster? After data is transferred, shut down the old cluster. Ideally you wouldn`t even have any downtime.
You can take your SSTables from your data directory and then can use sstableloader in new data center to import the data.
Before doing this activity you might consider doing compaction so that you have only one SSTable per table.
SSTable Loader
Using SFTP server or through some other way, transfer the SSTables from old cluster to new cluster (one DC is enough) and use SSTableLoader. The data replication to another DC will be taken care by Cassandra.
In cassandra there are two type of strategy SimpleStrategy and NetworkTopologyStrategy by using NetworkTopologyStrategy you can replicate in different cluster. see this documentation Data replication
You can use COPY command to export and import csv from one table to another
Simple data importing and exporting with Cassandra

3 nodes cassandra with one being a spark master - to solve geospatial data or geographic data

I am looking for directions:
I have a cassandra database with latitude & longitude data. I need to search for data within a radius or a box coordinates around a point. I am using golang(gocql) client to query Cassandra.
I need some understanding regarding Spark and Cassandra as this seams like the way to go.
Is the following assumptions correct; I have 2 Cassandra nodes(the data in a replica of 2).
Should I then install an extra node and install Spark on this and then connect it to the other two existing Cassandra nodes containing the data(With the Spark Connector from DataStax).
And do the two existing Cassandra nodes need to have Spark workers installed on them to work with Spark Master node?
When the Spark setup is in place, do you query(Scala) the existing data and then save the data onto the Spark node and then query this with the gaoling(gocql) client?
Any directions is welcome
Thanks in advance
Geospatial Searching is a pretty deep topic. If it's just doing searches that you're after (not batch/analytics), I can tell you that you probably don't want to use Spark. Spark isn't very good at 'searching' for data - even when it's geospatial. The main reason is that Spark doesn't index data for efficient searches and you'd have to create a job/context (unless using job server) every time you'd want to do a search. That takes forever when you're thinking in terms of user facing application time.
Solr, Elastic Search, and DataStax Enterprise Search (Disclaimer I work for DataStax) are all capable of box and radius searches on Cassandra data and do so in near real time.
To answer your original question though, if the bulk of your analytics in general come from Cassandra data, it may be good idea to run Spark on the same nodes as Cassandra for data locality. The nice thing is that Spark scales quite nicely, so if you find Spark taking too many resources from Cassandra, you can simply scale out (both Cassandra and Spark).
Should I then install an extra node and install Spark on this and then
connect it to the other two existing Cassandra nodes containing the
data(With the Spark Connector from DataStax).
Spark is a cluster compute engine so it needs a cluster of nodes to work well. You'll need to install it on all nodes if you want it to be as efficient as possible.
And do the two existing Cassandra nodes need to have Spark workers
installed on them to work with Spark Master node?
I don't think they 'have' to have them, but it's a good idea for locality. There's a really good video on academy.datastax.com that shows how the spark cassandra connector reads data from Cassandra to Spark. I think it will clear a lot of things up for you: https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data
When the Spark setup is in place, do you query(Scala) the existing
data and then save the data onto the Spark node and then query this
with the gaoling(gocql) client?
The Spark-Cassandra connector can communicate to both Cassandra and Spark. There are methods, saveToCassandra(), for example, that will write data back to Cassandra your jobs are processed. Then you can use your client as you normally would.
There are some really good free Spark + Cassandra tutorials at academy.datastax.com. This is also a good place to start: http://rustyrazorblade.com/2015/01/introduction-to-spark-cassandra/

Creating new datacenter with Datastax OpsCenter

I'd like to enable vnodes on my cassandra cluster, which has an Analytics dc and a regular Cassandra dc. I am using OpsCenter 5.0.1 and DSE 4.5. My question is: how can I create a new dc with OpsCenter, with vnodes enabled, so I can transfer my data over from my existing dc's. I am following the instructions on this page, but surely I don't have to manually edit the config file on every node, to enable a new datacenter, right? Any help much appreciated.
Unfortunately OpsCenter's automated provisioning doesn't currently support creating multi-dc clusters or adding data centers to existing clusters. We know this is important functionality that's missing, and are working on making that available as soon as we can.

Resources