Understand that Cassandra does not have a master-slave relationship which what I am after(peer to peer) and the data replication concept of Cassandra is one also the other feature I am looking forward for.
However, is there a inbuilt function within Cassandra that will perform data model failover without the enhancement on the application. I am not after cluster failover but data model failover and possible of rollback of data model.
And would also like to know if possible for database compare tool natively.
Sorry for my bad English.
Thanks.
Etc:
Data Model A is table1....table10 (for example)
Data Model B is replica as Data Model A
Scenario....
Control Centre East - 3 nodes ( consistency must all table1....table10)
Control Center West - 2 nodes ( eventual consistency table1....table10)
Control Center North - 1 nodes ( consistency must be high for some tables table1...table5 and eventual consistency for the rest)
Sequence of event
All nodes read/write to Data Model set A (I am mean Client read and write Data Model set A). Data Model A is the one in production.
Make changes to Data Model B (add new row, modify column value etc)
Failover or changeover for all nodes from Data Model set A to Data Model set B.
During the failover, some small amout of column value for identical row between Data Model A and B has to be copied over to from Datamodel A to B.
After the failover, check if application giving wrong calculation(formula/output) based on the new Data Model B.
Data model B is problematic, failover back to Data Model A for all nodes. Check/fix on Data Model B before running seq 2-3 again.
Hopefully, it is not too long winded.
Related
Consider a Cassandra instance deployed across two data centers for geo-redundancy.
Is it possible configure this cluster with a consistency level such that we get both Geo-redundancy (availablity even if one of the entire data-center takes a downtime and the instance continues to operate with one data center) and full consistent read and write. Does this ask violate the CAP theorem?
No you can't (otherwise you'd violate CAP).
If you want fully consistent read/writes across data centers then you will have to give up availability or partition tolerance.
I had a couple questions regarding the Cassandra connector written by Data Mountaineer. Any help is greatly appreciated as we're trying to figure out the best way to scale our architecture.
Do we have to create a Connector config for each Cassandra table we want to update? For instance, let's say I have a 1000 tables. Each table is dedicated to a different type of widget. Each widget has similar characteristics, but slightly different data. Do we need to create a connector for each table? If so, how is this managed and how does this scale?
In Cassandra, we often need to model column families based on the business need. We may have 3 tables representing user information. 1 by username, 1 by email and 1 by last name. Would we need 3 connector configs and deploy 3 separate Sink tasks to push data to each table?
I think both questions are similar, can the sink handle multiple topics?
The sink can handle multiple tables in one sink so one configuration. This is set in the kcql statement connect.cassandra.export.route.query=INSERT INTO orders SELECT * FROM orders-topic;INSERT INTO positions SELECT * FROM positions but at present they need to be in the same Cassandra keyspace. This would route events from the trades topic to a Cassandra table called trades and events from positions. You can also select specific columns and rename like select columnA as columnB.
You may want more than one sink instance for separation of concerns, i.e. isolating the write of a group of topics from other unrelated topics.
You can scale with the number of tasks the connector is allowed to run, each task starts a Writer for all the target tables.
We have a support channel of our own for more direct communication. https://datamountaineer.com/contact/
How do you configure Cassandra so that some tables are NOT replicated at all but others are? Is this actually a good use case for Cassandra?
I have a group of customers (max. 50) that will all supply data on a daily basis (~50,000 records per customer per day, ~200 fields per record). I need to pre-process the data to obfuscate sensitive information locally, then combine the data centrally for analysis and then allow reporting against the combined data set. I am planning on each customer having a local Cassandra node for the raw data load (several flat files), but I don't want this replicated until the obfuscation is complete. Can I do this with different tables spaces and replication factors? The data can be keyed using customer ID as a PK, if that helps.
You could have a keyspace for the customer raw data with a replication factor of 1 and keep the raw data tables in there and then have the obfuscated data tables in a separate keyspace with a replication factor > 1.
So far, I've been through data partitioning in Cassandra and found some basic ways of doing things, like if you have 6 nodes, with 3 each in two separate data centers, we have the following method of data replication:
Data replication occurs by parsing through nodes until Cassandra comes across a node in the ring belonging to another data center and places the replica there, repeating the process until all data centers have one copy of the node - as per NetworkTopologyStrategy.
SO, we have two copies of the entire data with one in each data center. But, what if I wanted to logically split data into two separate chunks, based on some attribute like business or geographic location.(Data for India in India DataCenter). So, we would have a chunk of data in datacenters across one geographic location, another chunk in another location and none of them overlapping.
Would that be possible?
And given the application of Cassandra and Big Data in general, would that make sense?
Geographic sharding is certainly possible. You simply run multiple data centers that aren't connected, then they won't replicate. Alternatively, you can have them replicate, but your India-based app only reads and writes to your India DC. Whether it makes sense depends on your application.
I am new to Cassandra and I would like to learn more about Cassandra's racks and structure.
Suppose I have around 70 column families in Cassandra and two AWS2 instances.
How many Data Centres will be used?
How many nodes will each rack have?
Is it possible to divide a column family in multiple keyspaces?
The intent of making Cassandra aware of logical racks and data centers is to provide additional levels of fault tolerance. The idea (as described in this document, under the "Network Topology Strategy") is that the application should still be able to function if one rack or data center goes dark. Essentially, Cassandra...
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
In this way, you can also query your data by LOCAL_QUORUM, in which QUORUM ((replication_factor / 2) + 1) is only computed from the nodes present in the same data center as the coordinator node. This reduces the effects of inter-data center latency.
As for your questions:
How many data centers are used are entirely up to you. If you only have two AWS instances, putting them in different logical data centers is possible, but only makes sense if you are planning to use consistency level ONE. As-in, if one instance goes down, your application only needs to worry about finding one other replica. But even then, the snitch can only find data on one instance, or the other.
Again, you can define the number of nodes that you wish to have for each rack. But as I indicated with #1, if you only have two instances, there isn't much to be gained by splitting them into different data centers or racks.
I do not believe it is possible to divide a column family over multiple keyspaces. But I think I know what you're getting at. Each keyspace will be created on each instance. As you have 2 instances, you will be able to specify a replication factor of 1 or 2. If you had 3 instances, you could set a replication factor of 2, and then if you lost 1 instance you would still have access to all the data. As you only have 2 instances, you need to be able to handle one going dark, so you will want to make sure both instances have a copy of every row (replication factor of 2).
Really, the logical datacenter/rack structure becomes more-useful as the number of nodes in your cluster increases. With only two, there is little to be gained by splitting them with additional logical barriers. For more information, read through the two docs I linked above:
Apache Cassandra 2.0: Data Replication
Apache Cassandra 2.0: Snitches