Hub-Spoke model with Cassandra - cassandra

I'm trying to create Hub-Spoke topology with Cassandra. I want to have one centralised C* server and many spoke c* servers. Whenever a new records comes to any of the spoke, it should be moved to Hub c* server. I tried with replication startegies but its seems to be bi-directional. Means, If i insert a record in node1 and i'm able to see the record in all the nodes in my cluster.any suggestions/guidance will highly appreciated here.

This is a feature introduced in DataStax Enterprise 5.0. You can find all the details in the docs, but super summarized the DSE Advanced Replication provides a unidirectional replication from remote clusters to a central hubs which also supports prioritization of data streams.

Related

Does Scylla DB have a similar migration support to GKE as K8ssandra's Zero Downtime Migration feature?

We are trying to migrate our ScyllaDB cluster deployed on GCE machines to the GKE cluster in Google Cloud, we came across one approach of Cassandra migration and want to implement the same here in ScyllaDB migration. Below is the link for the same, can you please suggest if this is possible in Scylla ?
or if Scylla hasn't introduced such a migration technique with the Scylla K8S operator ?
https://k8ssandra.io/blog/tutorials/cassandra-database-migration-to-kubernetes-zero-downtime/
Adding a new "destination" DC to your existing cluster "source" DC, is a very common technic to migrate to a new DC.
Add the new "destination" DC
Change replication factor settings accordingly
nodetool rebuild --> stream data from the "source" DC to the "destination" DC
nodetool repair the new DC.
Update your application clients to connect to the new DC once it's ready to serve (all data streamed + repaired)
Decommission the "old" (source) DC
For the gory details see here:
https://docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/add-dc-to-existing-dc.html
https://docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/decommissioning-data-center.html
If you prefer to go the full scan route. CQL reads on the source and CQL writes on the destination, with some ability for data manipulation and save points to resume from, than the Scylla Spark Migrator is a good option.
https://github.com/scylladb/scylla-code-samples/tree/master/spark-scylla-migrator-demo
You can also use the Scylla Spark migrator to migrate parquet files
https://www.scylladb.com/2020/06/10/migrate-parquet-files-with-the-scylla-migrator/
Remember not to migrate Materialized views (MV), you can always re-create them post migration again from the base tables.
We use an Apache Spark-based Migrator: https://github.com/scylladb/scylla-migrator
Here's the blog we wrote on how to do this back in 2019: https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/
Though in this case, you aren't moving from Cassandra to ScyllaDB; just moving from one ScyllaDB instance to another. If this makes sense to you, it should be straight forward. If you have questions, feel free to join our Slack community to get more interactive assistance:
http://slack.scylladb.com/

Multi DC replication between different Cassandra versions

We have an existing Cassandra cluster (3.0.9) running on production.
Now ,we want to create data pipelines to ingest data from Cassandra and persist in hadoop. We are thinking of using CDC feature (available from Cassandra 3.8) along with Kafka Connect.
We are thinking of creating a new read-only DC which will replicate data from the Production DC.This new DC will be running the latest Cassandra version (3.8+) with CDC enabled.
My questions:
For replication to work, do we need both dc's running same version of Cassandra? Can't we achieve this without upgrading the DC used by the service?
Is it possible to enable CDC feature only in the new read-only DC?
UPDATE :
More information from C* mailing list https://lists.apache.org/thread.html/r9e705895c480f264998c29cf69c0eb2296382049467e31c447f676c7%40%3Cuser.cassandra.apache.org%3E
I think, it should be the same version as existing DC for replication of the data by adding a DC. you may refer below recommended document below for adding new datacenter in existing cluster.
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/operations/opsAddDCToCluster.html
You should upgrade the existing DC from lower to upper version of Cassandra to get expected feature.
You can make your DC as read only without sending any direct traffic in the new DC. all connection should be on older DC.

Creating new datacenter with Datastax OpsCenter

I'd like to enable vnodes on my cassandra cluster, which has an Analytics dc and a regular Cassandra dc. I am using OpsCenter 5.0.1 and DSE 4.5. My question is: how can I create a new dc with OpsCenter, with vnodes enabled, so I can transfer my data over from my existing dc's. I am following the instructions on this page, but surely I don't have to manually edit the config file on every node, to enable a new datacenter, right? Any help much appreciated.
Unfortunately OpsCenter's automated provisioning doesn't currently support creating multi-dc clusters or adding data centers to existing clusters. We know this is important functionality that's missing, and are working on making that available as soon as we can.

Ability to write to a particular cassandra node

Is there a possibility to write to a particular node using datastax driver?
For example, I have three nodes in datacenter 1 and three nodes in datacenter 2.
Existing
If i build up the cluster with any one of them as seed, all the nodes will get detected by the datastax java driver. So, in this case, if i insert a data using driver, it will automatically choose one of the nodes and proceed with it as the co-ordinator(preferably local data center)
Requirement
I want a way to contact any node in datacenter 2 and hand over the co-ordinator job to one of the nodes in datacenter 2.
Why i need this
I am trying to use the trigger functionality from datacenter 2 alone. Since triggers are taken care by co-ordinator , i want a co-ordinator to be selected from datacenter 2 so that data center 1 doesnt have to do this operation.
You may be able to use the DCAwareRoundRobinPolicy load balancing policy to achieve this by creating the policy such that DC2 is considered the "local" DC.
Cluster.Builder builder = Cluster.builder().withLoadBalancingPolicy(new DCAwareRoundRobinPolicy("dc2"));
In the above example, remote (non-DC2) nodes will be ignored.
There is also a new WhiteListPolicy in driver version 2.0.2 that wraps another load balancing policy and restricts the nodes to a specific list you provide.
Cluster.Builder builder = Cluster.builder().withLoadBalancingPolicy(new WhiteListPolicy(new DCAwareRoundRobinPolicy("dc2"), whiteList));
For multi-DC scenarios Cassandra provides EACH and LOCAL consistency levels where EACH will acknowledge successful operation in each DC and LOCAL only in local one.
If I understood correctly, what you are trying to achieve is DC failover in your application. This is not a good practice. Let's assume your application is hosted in DC1 alongside with Cassandra. If DC1 goes down, your entire application is unavailable. If DC2 goes down, your application still can write with LOCAL CL and C* will replicate changes when DC2 is back.
If you want to achieve HA, you need to deploy application in each DC, use CL=LOCAL_X and finally do failover on DNS level (e.g. using AWS Route53).
See data consistency docs and this blog post for more info about consistency levels for multiple DCs.

GridGain open source datacenter topology specification

GRIDGAIN DATA-CENTER REPLICATION
A few specific questions regarding the recently open-sourced Gridgain code. The gridgain.org support link says datacenter replication is not enabled for the open-source version. Is this true or false.
More imporatantly, assuming the open-source version has the datacenter feature enabled, how do we go about specifying the topology and activating the replication.
For example, the official documentation suggest to create/set a GridDrSenderCacheConfiguration, GridDrSenderHubConfiguration with details of the topology. I did this but it didnt seem to enable any cross data center replication.
More specifically, I did the following:
assign a dataCenterId byte parameter in the config.xml for gridgain.
...
define those nodes that are part of that datacenter under the
... add ip addresses of nodes
Define above for each node in each datacenterl appropriately. In the gridgain java client code, initiate a gridgain instance and set the GridDrSenderCacheConfiguration,GridDrSenderHubConnection (along wtih the GridDrSenderHubConnectionConfiguration) as specified in the docs for each node in each datacenter and also using a dummy GridDrReceiverHubConfiguration object (all defaults)
However this does not seem to do any replication across the data centers.
Would someone from the GridGain team please give some examples of setting up the data center replication, How to setup the config.xml, and enable in the java code when instantiating a gridgain instance.
Also, I am trying to avoid intra-datacenter replication by setting the gridDrSenderHubConnectionConfiguration.setIgnoredDataCenterIds(localDC); paramter to avoid replicating if the datacenter is
Just confirmed. Since data center replication is not present in open source version, no replication would happen in this case. Please download eval version of GridGain enterprise edition and try it out.

Resources