So far, I've been through data partitioning in Cassandra and found some basic ways of doing things, like if you have 6 nodes, with 3 each in two separate data centers, we have the following method of data replication:
Data replication occurs by parsing through nodes until Cassandra comes across a node in the ring belonging to another data center and places the replica there, repeating the process until all data centers have one copy of the node - as per NetworkTopologyStrategy.
SO, we have two copies of the entire data with one in each data center. But, what if I wanted to logically split data into two separate chunks, based on some attribute like business or geographic location.(Data for India in India DataCenter). So, we would have a chunk of data in datacenters across one geographic location, another chunk in another location and none of them overlapping.
Would that be possible?
And given the application of Cassandra and Big Data in general, would that make sense?
Geographic sharding is certainly possible. You simply run multiple data centers that aren't connected, then they won't replicate. Alternatively, you can have them replicate, but your India-based app only reads and writes to your India DC. Whether it makes sense depends on your application.
Related
Consider a growing number of data, let's choose from two extreme choices:
Evenly distribute all data across all nodes in the cluster
We pack them to as few nodes as possible
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
However, some resources state that we shouldn't query all the nodes because that will slow down the query. Why would that slow the query? Isn't that just a normal scatter and gather? They even claim this hurts linear scalability as adding more nodes will further drag down the query.
(Maybe I am missing on how Cassandra performs the query, some background reference is appreciated).
On the contrary, some resources state that we should go with option 2 because it queries the least number of nodes.
Of course there is no black and white choices here; everything must have a tradeoff.
I want to know, what's the real difference between option 1 and option 2. Plus, regarding the network querying, why option 1 would be slow.
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
You definitely want to go with option #1. This is also preferable, in that new or replacement nodes will stream much faster than a cluster made of fewer, dense nodes.
However, some resources state that we shouldn't query all the nodes because that will slow down the query.
And those resources are absolutely correct. First of all, if you read through the resources which Alex posted above you'll discover how to build your tables so that your queries can be served by a single node. Running queries which only hit a single node is the best way around that problem.
Why would that slow the query?
Because in a distributed database environment, query time becomes network time. There are many people out there who like to run multi-key or unbound queries against Cassandra. When that happens, and the query is unable to find a single node with the data, Cassandra picks one node to designate as a "coordinator."
That node builds the result set with data from the other nodes. Which means in a 30 node cluster, that one node is now pulling data from the other 29. Assuming that these requests don't time-out, the likelihood that the coordinator will crash due to trying to manage too much data is very high.
The bottom line, is that this is one of those tradeoffs between a CA relational database and an AP partitioned row store. Build your tables to support your queries, store data together which is queried together, and Cassandra will perform just fine.
How does cassandra handle large amount of reads for a single key? Think about a very popular celebrity whose twitter page is hit consistently.
you will usually have multiple replicas of each shard. Lets say your replica count is 3. Then reads for a single key can be spread over the nodes hosting those replicas. But that's the limit of the parallelism - adding more nodes to your cluster would not increase the number of replicas and hence the traffic would still have to talk to just those 3 nodes. There's various tricks people use for such cases (e.g. caching in the web server so it doesn't have to keep going back to the database or denormalizing the data so it is spread over more nodes).
Assuming 4 nodes split across 2 data centers (DC1-1, DC1-2, DC2-1, DC2-2).
Using partition groups and the default backup count of 1, the documentation and other questions/articles are pretty clear about how data is distributed assuming well distributed data - 25% per node as primary, all the primary data in DC1-1/DC1-2 will be backed up on either DC2-1/DC2-2 and vice versa.
It is not clear what the expected behavior is under same situation if we were to increase backup count to 2. Assuming entry #1 currently as primary on DC1-1. Would the two entries of backup both be forced to the two DC2 nodes? Is there a way to make it such that there is one backup in each partitiongroup? (i.e. primary DC1-1, backup on DC1-2, backup on either DC2-1 or DC2-2)?
Thanks
First of all we do not recommend to split a single cluster over multiple data centers. There are possible exceptions but keep in mind that latency between data centers is important as you partition the data.
To your question:
If you have just two partition groups defined there is no way to create more than one backup. You have to imagine a normal cluster to be one node per partition group, therefore you can have pG-1 backups. If you change the configuration to 2 partition groups that means you can only have one backup.
I am new to Cassandra and I would like to learn more about Cassandra's racks and structure.
Suppose I have around 70 column families in Cassandra and two AWS2 instances.
How many Data Centres will be used?
How many nodes will each rack have?
Is it possible to divide a column family in multiple keyspaces?
The intent of making Cassandra aware of logical racks and data centers is to provide additional levels of fault tolerance. The idea (as described in this document, under the "Network Topology Strategy") is that the application should still be able to function if one rack or data center goes dark. Essentially, Cassandra...
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
In this way, you can also query your data by LOCAL_QUORUM, in which QUORUM ((replication_factor / 2) + 1) is only computed from the nodes present in the same data center as the coordinator node. This reduces the effects of inter-data center latency.
As for your questions:
How many data centers are used are entirely up to you. If you only have two AWS instances, putting them in different logical data centers is possible, but only makes sense if you are planning to use consistency level ONE. As-in, if one instance goes down, your application only needs to worry about finding one other replica. But even then, the snitch can only find data on one instance, or the other.
Again, you can define the number of nodes that you wish to have for each rack. But as I indicated with #1, if you only have two instances, there isn't much to be gained by splitting them into different data centers or racks.
I do not believe it is possible to divide a column family over multiple keyspaces. But I think I know what you're getting at. Each keyspace will be created on each instance. As you have 2 instances, you will be able to specify a replication factor of 1 or 2. If you had 3 instances, you could set a replication factor of 2, and then if you lost 1 instance you would still have access to all the data. As you only have 2 instances, you need to be able to handle one going dark, so you will want to make sure both instances have a copy of every row (replication factor of 2).
Really, the logical datacenter/rack structure becomes more-useful as the number of nodes in your cluster increases. With only two, there is little to be gained by splitting them with additional logical barriers. For more information, read through the two docs I linked above:
Apache Cassandra 2.0: Data Replication
Apache Cassandra 2.0: Snitches
I'm designing a Riak cluster at the moment and wondering if it is possible to hint Riak that a specific bunch of keys should be placed on a single node of the cluster?
For example, there is some private data for the user, that only she is able to access. This data contains ~10k documents (too large to be kept in one key/document), and to serve one page, we need to retrieve ~100 of them. It would be better to keep the whole bunch on a single node + have the application on the same instance to make this faster.
AFAIK it is easy on Cassandra: just use OrderedPartitioner and keys like this: <hash(username)>/<private data key>. That way, almost all user keys will be kept on a single node.
One of the points of using Riak is that your data is replicated and evenly distributed throughout the cluster, thus improving your tolerance for network partitions and outages. Placing data on specific nodes goes against that goal and increases your vulnerability.