I have a multi-datacenter environment setup in cassandra. I have triggers setup, but the triggers only run on the node that the data was changed on, in only one DC. Cassandra does a good job of data replication though all data centers. But is there a way to setup cassandra triggers so that they run in each data center?
According to datastax cql_reference, Place the custom trigger code (JAR) in the triggers directory on every node.
https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlCreateTrigger.html
Related
We are trying to migrate our ScyllaDB cluster deployed on GCE machines to the GKE cluster in Google Cloud, we came across one approach of Cassandra migration and want to implement the same here in ScyllaDB migration. Below is the link for the same, can you please suggest if this is possible in Scylla ?
or if Scylla hasn't introduced such a migration technique with the Scylla K8S operator ?
https://k8ssandra.io/blog/tutorials/cassandra-database-migration-to-kubernetes-zero-downtime/
Adding a new "destination" DC to your existing cluster "source" DC, is a very common technic to migrate to a new DC.
Add the new "destination" DC
Change replication factor settings accordingly
nodetool rebuild --> stream data from the "source" DC to the "destination" DC
nodetool repair the new DC.
Update your application clients to connect to the new DC once it's ready to serve (all data streamed + repaired)
Decommission the "old" (source) DC
For the gory details see here:
https://docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/add-dc-to-existing-dc.html
https://docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/decommissioning-data-center.html
If you prefer to go the full scan route. CQL reads on the source and CQL writes on the destination, with some ability for data manipulation and save points to resume from, than the Scylla Spark Migrator is a good option.
https://github.com/scylladb/scylla-code-samples/tree/master/spark-scylla-migrator-demo
You can also use the Scylla Spark migrator to migrate parquet files
https://www.scylladb.com/2020/06/10/migrate-parquet-files-with-the-scylla-migrator/
Remember not to migrate Materialized views (MV), you can always re-create them post migration again from the base tables.
We use an Apache Spark-based Migrator: https://github.com/scylladb/scylla-migrator
Here's the blog we wrote on how to do this back in 2019: https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/
Though in this case, you aren't moving from Cassandra to ScyllaDB; just moving from one ScyllaDB instance to another. If this makes sense to you, it should be straight forward. If you have questions, feel free to join our Slack community to get more interactive assistance:
http://slack.scylladb.com/
I am using a Cassandra 4-node cluster with full replication in all nodes.
I have defined a trigger on a table. However, when I update a row in this table, trigger is fired only on the local node.
Is there any way to fire this trigger in all nodes (based on replication)?
Triggers run on the coordinator before they are passed off on be applied. To see it on a per replica the best way is to use CDC (which is also more reliable than triggers) and follow the changes as they are flushed to commitlog.
With CDC you have to solve another problems:
validate order of the pockets, since it is not guaranteed
make tradeoff between single point of failure vs implementing tool for CDC logs duplication checker, let me explain:
You either enable CDC logging on one node and this will become your bottleneck. Or you enable CDC on all nodes and then you have to somehow manage data duplication since leader will send logs to repications.
You can deploy triggers on every node of your cluster. It won't cause any data duplication and works perfectly fine.
Assuming that we have 2 cassandra datacenters.
One of them is productive environment and well-secured, the other one is a test environment and easier to break, hence non-trusted.
We want data replication, but only propagated from the productive environment to test environment, not vice versa.
Is there any way to configure one data-center as a slave: not to receive replication data from the other one, and to revert the untrusted changes? It should be a read-only instance, which only receives data from the other datacenter.
In case somebody breaks the test environment, we do not want to productive environment to receive any manipulated data. Target would be that the test environment changes get reverted to the productive environment during replication.
No, it's not possible directly - in Cassandra changes made to the keyspace are propagated to all sides.
You can try different options by using separate clusters for prod and test:
Implement code to read CDC files, and apply to test cluster - this won't help with deleting the data from test environment, as this approach only apply changes.
Use DataStax advanced replication (that uses similar approach)
periodically replay the data from production to test using SSTableLoader - it will replay all data, so it will help with deletion of data on test. But it could take quite a long time if you have a lot of data.
Is there a possibility to write to a particular node using datastax driver?
For example, I have three nodes in datacenter 1 and three nodes in datacenter 2.
Existing
If i build up the cluster with any one of them as seed, all the nodes will get detected by the datastax java driver. So, in this case, if i insert a data using driver, it will automatically choose one of the nodes and proceed with it as the co-ordinator(preferably local data center)
Requirement
I want a way to contact any node in datacenter 2 and hand over the co-ordinator job to one of the nodes in datacenter 2.
Why i need this
I am trying to use the trigger functionality from datacenter 2 alone. Since triggers are taken care by co-ordinator , i want a co-ordinator to be selected from datacenter 2 so that data center 1 doesnt have to do this operation.
You may be able to use the DCAwareRoundRobinPolicy load balancing policy to achieve this by creating the policy such that DC2 is considered the "local" DC.
Cluster.Builder builder = Cluster.builder().withLoadBalancingPolicy(new DCAwareRoundRobinPolicy("dc2"));
In the above example, remote (non-DC2) nodes will be ignored.
There is also a new WhiteListPolicy in driver version 2.0.2 that wraps another load balancing policy and restricts the nodes to a specific list you provide.
Cluster.Builder builder = Cluster.builder().withLoadBalancingPolicy(new WhiteListPolicy(new DCAwareRoundRobinPolicy("dc2"), whiteList));
For multi-DC scenarios Cassandra provides EACH and LOCAL consistency levels where EACH will acknowledge successful operation in each DC and LOCAL only in local one.
If I understood correctly, what you are trying to achieve is DC failover in your application. This is not a good practice. Let's assume your application is hosted in DC1 alongside with Cassandra. If DC1 goes down, your entire application is unavailable. If DC2 goes down, your application still can write with LOCAL CL and C* will replicate changes when DC2 is back.
If you want to achieve HA, you need to deploy application in each DC, use CL=LOCAL_X and finally do failover on DNS level (e.g. using AWS Route53).
See data consistency docs and this blog post for more info about consistency levels for multiple DCs.
I have a Cassandra cluster managed by Priam, with 3 nodes. I use ephemeral disks to store my Cassandra data, so when I start 1 node, the Cassandra data dir is empty.
I have Priam properly configured and I can see backups are saved in Amazon S3. Suppose a node goes down and then I start another node. Will Priam know how to automatic restore backup from S3 when the node comes up again? The Cassandra data dir will start empty, so I am assuming Priam would give the new node the same token as the old one and it would restore the data... Right?
Yes. I have been running standalone Cassandra on EC2, small Cassandra clusters on mesos on EC2, and larger DataStax Enterprise clusters (with Cassandra) on EC2.
I have been using the Priam 3.x branch.
On restore, it calculates the initial_token, updates the cassandra.yaml file, restores the snapshot and incremental backup files, and restarts Cassandra.
According to Priam/Netflix conventions, if you have a 3 node cluster with Cassandra, your nodes should be named some_thing-other-things. Each node should be a part of an Auto-scaling group called some_thing. Each node should also use a Security Group named some_thing.
Create a 3 node dev cluster and test your backups and restores with data that you can easily recreate, that you don't care about too much. Get used to managing the Auto-scaling groups and Priam. Then, try it on test clusters with data that you care about.