Scylla datacenter and Cassandra datacenter in same cluster - cassandra

I have running 21 node Cassandra cluster with 150+ schema and about 20 TB data.I need to shift the schema and data from Cassandra to 7 node Scylla in no downtime scenario.
Both Scylla and Cassandra support the same cqlsh version and almost same in distributing the data and gossiping.
To shift the data I am trying to create new Scylla datacenter in existing Cassandra cluster and update the Keyspace topology to have Scylla also in the DC list of replication then Bootstrap/Rebuild the Scylla node in the cluster.
To do this I am getting error of TCP connection failure when adding seed list in node.
Scylla Error:-
scylla: [shard 0] rpc - client 10.200.1.2:34236: server connection dropped: connection is closed
scylla: [shard 0] rpc - client 10.200.1.2:7000: fail to connect: Connection refused.
Cassandra Error:-
[MessagingService-Outgoing-/10.200.2.2-Gossip] OutboundTcpConnection.java:411 - Socket to /10.200.2.2 closed
[HANDSHAKE-/10.200.2.2] OutboundTcpConnection.java:570 - Cannot handshake version with /10.200.2.2
[HANDSHAKE-/10.200.2.2] OutboundTcpConnection.java:561 - Handshaking version with /10.200.2.2
Please help me if anyone has done this already or any better idea of shifting data without downtime, without data loss in less risk.

You can not have an heterogeneous cluster with C* and Scylla nodes on the same cluster.
Create a separate scylla cluster, create the schema, change the app to do double writes (to both clusters) and then migrate the C* historical data to Scylla.
There are multiple ways to migrate the data. This should help: https://youtu.be/CDOesdWDT9Y No downtime, no problem there are options for that too.

While Scylla is compatible with Cassandra across several axes (SSTables, CQL/Drivers, etc.), Scylla did need to make some changes to the gossip protocol which make it impossible to join a Cassandra cluster. There is no known way to join Scylla to a Cassandra cluster.
Scylla has published several suggested techniques for migration.
Blog describing the techniques: https://www.scylladb.com/2019/04/02/spark-file-transfer-and-more-strategies-for-migrating-data-to-and-from-a-cassandra-or-scylla-cluster/
Webinar walking through the migration techniques [requires registration]: https://go.scylladb.com/wbn-spark-scylla-migration-strategies-registration.html
Documentation: https://docs.scylladb.com/operating-scylla/procedures/cassandra_to_scylla_migration_process/
Community Slack for Q&A: http://slack.scylladb.com

Related

Is it possible to backup a 6-node DataStax Enterprise cluster and restore it to a new 4-node cluster?

I have this case. We have 6 nodes DSE cluster and the task is to back it up, and restore all the keyspaces, tables and data into a new cluster. But this new cluster has only 4 nodes.
Is it possible to do this?
Yes, it is definitely possible to do this. This operation is more commonly referred to as "cloning" -- you are copying the data from one DataStax Enterprise (DSE) cluster to another.
There is a Cassandra utility called sstableloader which reads the SSTables and loads it to a cluster even when the destination cluster's topology is not identical to the source.
I have previously documented the procedure in How to migrate data in tables to a new Cassandra cluster which is also applicable to DSE clusters. Cheers!

Multi DC replication between different Cassandra versions

We have an existing Cassandra cluster (3.0.9) running on production.
Now ,we want to create data pipelines to ingest data from Cassandra and persist in hadoop. We are thinking of using CDC feature (available from Cassandra 3.8) along with Kafka Connect.
We are thinking of creating a new read-only DC which will replicate data from the Production DC.This new DC will be running the latest Cassandra version (3.8+) with CDC enabled.
My questions:
For replication to work, do we need both dc's running same version of Cassandra? Can't we achieve this without upgrading the DC used by the service?
Is it possible to enable CDC feature only in the new read-only DC?
UPDATE :
More information from C* mailing list https://lists.apache.org/thread.html/r9e705895c480f264998c29cf69c0eb2296382049467e31c447f676c7%40%3Cuser.cassandra.apache.org%3E
I think, it should be the same version as existing DC for replication of the data by adding a DC. you may refer below recommended document below for adding new datacenter in existing cluster.
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/operations/opsAddDCToCluster.html
You should upgrade the existing DC from lower to upper version of Cassandra to get expected feature.
You can make your DC as read only without sending any direct traffic in the new DC. all connection should be on older DC.

Migrate Datastax Enterprise Cassandra to Apache Cassandra

We have currently using DSE 4.8 and 5.12. we want to migrate to apache cassandra .since we don't use spark or search thought save some bucks moving to apache. can this be achieved without down time. i see sstableloader works other way. can any one share me the steps to follow to migrate from dse to apache cassandra. something like this from dse to apache.
https://support.datastax.com/hc/en-us/articles/204226209-Clarification-for-the-use-of-SSTABLELOADER
Figure out what version of Apache Cassandra is being run by DSE. Based on the DSE documentation DSE 4.8.14 is using Apache Cassandra 2.1 and DSE 5.1 is using Apache Cassandra 3.11
Simplest way to do this is to build another DC (Logical DC per Cassandra) and add it to the existing cluster.
As usual, with a "Nodetool Rebuild {from-old-DC}" on to the new DC nodes, let Cassandra take care of streaming data to the new Apache Cassandra nodes naturally.
Once data streaming is completed, based on the LoadBalancingPolicy being used by applications, switch their local_dc to DC2 (the new DC). Once the new DC starts taking traffic, shutdown nodes in old DC say DC1 one by one.
alter keyspace dse_system and dse_security not using everywhere
on non-seed nodes, cleanup cassandra data directory
turn on replace in cassandra-env.sh
start instance
monitoring streaming process using command 'nodetool netstats|grep Receiving'
change seeds node definition and rolling restart before finally migrate previous seeds nodes.

DSE 5 and DSE 4.8.9 in Same Cluster

Is it at all possible to have two different DSE versions in the same cluster? In my case, I have a a cluster of two DSE 5 nodes and another one of two DSE 4.8.9 nodes. Can I connect them such that data is replicated from DSE 4.8.9 to DSE 5 in real time?
No. If you were to try this, you'd be in an "Upgrade State." And clusters in an upgrade state are bound by these restrictions:
Do not enable new features.
Do not run nodetool repair.
Do not issue
these types of CQL queries during a rolling restart: DDL and
TRUNCATE.
During upgrades, the nodes on different versions might show
a schema disagreement.
Failure to upgrade SSTables when required
results in a significant performance impact and increased disk usage.
Upgrading is not complete until the SSTables are upgraded.
Trying something like this would be further exacerbated by the fact that 4.8.9 is based on Cassandra 2.1 and 5.0 is based on Cassandra 3.0. There were some significant changes between the two, so you would undoubtedly run into problems.
The best way to go about this would be to upgrade your 4.8.9 nodes to 5.0 first, and then add your new 5.0 cluster nodes afterward.

Spark with replicated Cassandra nodes

I found article where author advices to use next Spark-Cassandra architecture schema(Spark Slave for each Cassandra node):
I have N Cassandra nodes. All nodes are complete replicas of each other. Is some sense to run Spark slave for each Cassandra node in my case?
Yes it does. The Spark-Cassandra connector is data locality aware, i.e. each Spark node co-located with a Cassandra node will make sure to only process the local Cassandra data, which avoids shuffling lots of data across the network. You can find out how this works by watching a talk by Russell Spitzer on this topic here.

Resources