Imagine that an organization has two data centers (named 'A' and 'B' for simplicity) with multiple nodes running in each, and there is a Hazelcast cluster over all of these nodes. Assume that there is a Distributed Map in this cluster, which is configured to have backup-count of 1.
Is there a way to configure the Hazelcast Distributed Map so that nodes in Data Center A are backed up on the nodes in Data Center B and vice versa? This would mean that the event of losing a single data center the Map data (and backup) is not lost?
what you want is called Partition Grouping. See documentation for details.
The simplest thing you can do is to include this snippet in your Hazelcast configuration:
<partition-group enabled="true" group-type="CUSTOM">
<member-group>
<interface>10.10.1.*</interface> <!-- network in data centre A -->
</member-group>
<member-group>
<interface>10.10.2.*</interface> <!-- network in data centre B -->
</member-group
</partition-group>
Another option is to create own cluster in each data centre and connect them via WAN replications. This will decrease latencies within the data centre, but it can produce conflicting updates and then it's up to your MergePolicy to deal with it.
Related
Is there any cloud storage system (i.e Cassandra, Hazelcast, Openstack Swift) where we can change the replication factor of selected objects? For instance lets say, we have found out hotspot objects in the system so we can increase the replication factor as a solution?
Thanks
In Cassandra the replication factor is controlled based on keyspaces. So you first define a keyspace by specifying the replication factor the keyspace should have in each of your data centers. Then within a keyspace, you create database tables, and those tables are replicated according to the keyspace they are defined in. Objects are then stored in rows in a table using a primary key.
You can change the replication factor for a keyspace at any time by using the "alter keyspace" CQL command. To update the cluster to use the new replication factor, you would then run "nodetool repair" for each node (most installations run this periodically anyway for anti-entropy).
Then if you use for example the Cassandra java driver, you can specify the load balancing policy to use when accessing the cluster, such as round robin, and token aware policy. So if you have multiple replicas of the the table holding the objects, then the load of accessing the object could be set to round robin on just the nodes that have a copy of the row you are accessing. If you are using a read consistency level of ONE, then this would spread out the read load.
So the granularity of this is not at the object level, but at the table level. If you had all your objects stored in one table, then changing the replication factor would change it for all objects in that table and not just one. You could have multiple keyspaces with different replication factors and keep high demand objects in a keyspace with a high RF, and less frequently accessed objects in a keyspace with a low RF.
Another way you could reduce the hot spot for an object in Cassandra is to make additional copies of it by inserting it into additional rows of a table. The rows are accessed on nodes by the compound partition key, so one field of the partition key could be a "copy_number" value, and when you go to read the object, you randomly set a copy_number value (from 0 to the number of copy rows you have) so that the load of reading the object will likely hit a different node for each read (since rows are hashed across the cluster based on the partition key). This approach would give you more granularity at the object level compared to changing the replication factor for the whole table, at the cost of more programming work to manage randomly reading different rows.
In Infinispan, you can also set number of owners (replicas) on each cache (equivalent to Hazelcast's map or Cassandra's table), but not for one specific entry. Since the routing information (aka consistent hash table) does not contain all keys but splits the hashCode() 32-bit range to variable amount of segments, and then specifies the distribution only for these segments, there's no way to specify the number of replicas per entry.
Theoretically, with specially forged keys and custom consistent hash table factory, you could achieve something similar even in one cache (certain sorts of keys would be replicated different amount of times), but that would require coding with deep understanding of the system.
Anyway, the reader would have to know the number of replicas in advance as this would be part of the routing information (cache in simple case, special keys as described above), therefore, it's not really practical unless the reader can know that.
I guess you want to use the replication factor for the sake of speeding up reads.
The regular Map (IMap) implementation, uses a master slave(s) setup, so all reads will go through the master. But there is a special setting available, so you are also allowed to read from backups. So if you have a 10 node cluster, and have a backup count of 5, there will be in total 6 members that have the information stored. 5 members in the cluster will hit the master, and 5 members in the cluster will hit the backup (since they have the backup locally available).
There also is a fully replicated map available, here every item is send to every machine. So in a 10 node cluster, all reads will be local since every machine has the same data.
In case of the IMap, we don't provide control on the number of backups on the key/value level. So the whole map is configured with a certain backup-count.
So far, I've been through data partitioning in Cassandra and found some basic ways of doing things, like if you have 6 nodes, with 3 each in two separate data centers, we have the following method of data replication:
Data replication occurs by parsing through nodes until Cassandra comes across a node in the ring belonging to another data center and places the replica there, repeating the process until all data centers have one copy of the node - as per NetworkTopologyStrategy.
SO, we have two copies of the entire data with one in each data center. But, what if I wanted to logically split data into two separate chunks, based on some attribute like business or geographic location.(Data for India in India DataCenter). So, we would have a chunk of data in datacenters across one geographic location, another chunk in another location and none of them overlapping.
Would that be possible?
And given the application of Cassandra and Big Data in general, would that make sense?
Geographic sharding is certainly possible. You simply run multiple data centers that aren't connected, then they won't replicate. Alternatively, you can have them replicate, but your India-based app only reads and writes to your India DC. Whether it makes sense depends on your application.
I've been looking at Datastax's Architecture in brief web page (and a few others) but I found it didn't really answer key questions I had. So I went ahead and wrote up an edited copy of the Datastax web page (see http://benslade.com/wordpress/?p=152, all feedback welcome).
I know I can figure things out by actually setting up a Cassandra database, but I don't like to have to figure out "what it does" for the user by having to figure out "how it's implemented" by the developer.
So, I have a few more questions about how thing work in Cassandra at an architecture level:
The overview says, "data is distributed among all nodes in the cluster. Each node exchanges information across the cluster every second". And later says a cluster is, "All writes are automatically partitioned and replicated throughout the cluster". What is the relationship between a cluster and a data center? Ie. is a data center a part of an overall cluster. Do all nodes in all data centers exchange info with each other every second? Does a write to any node in a particular data center get propagated to other data centers the same as it gets propagated in the current data center?
The overview says "Once the memory structure (memtable) is full, the data is written to disk in an SSTable data file". Can the same data been in the memtable and the SSTable at the same time. Ie. is the memtable a datacache for the SSTable?
In the future, please try to limit your posts to one question at-a-time.
What is the relationship between a cluster and a data center?
A cluster can contain one or more logical data centers. Cassandra is data center-aware, which means you can alter your replication strategy on a per-data center basis. Also, Cassandra has the concept of "locality," which means that the snitch can restrict a request to nodes in a particular data center.
EX: Querying by LOCAL_QUORUM will query data only from nodes in the data center that is determined to be the "closest" (network-wise). Whereas querying by QUORUM will query from (N/2+1) nodes, regardless of data center (where N = node count).
Do all nodes in all data centers exchange info with each other every second?
Again, the snitch handles the distribution of replicas and ensures that all nodes are kept current with the configured replication factor. Of course as Cassandra embraces the Highly-Available, Partition Tolerant side of the CAP Theorem, all replicas operate on the concept of "Eventual Consistency." Meaning, they will all get updated, but it may or may not happen before that data is requested.
Does a write to any node in a particular data center get propagated to other data centers the same as it gets propagated in the current data center?
Yes, but again it depends on the configured replication factor. Consider the following keyspace definition:
CREATE KEYSPACE stackoverflow WITH replication = {
'class': 'NetworkTopologyStrategy',
'WestCoastDC': '2',
'EastCoastDC': '3'
};
With this configuration, the snitch will ensure that a write to a replica in any data center will be propagated to my "WestCoastDC" until it has two copies of the data. Likewise, my "EastCoastDC" will have three copies of the same data. Note, your replication factor must be equal to or less than the number of nodes in that data center.
Can the same data been in the memtable and the SSTable at the same
time. Ie. is the memtable a datacache for the SSTable?
I don't believe this can happen. All writes in Cassandra should be written to the in-memory memtable, and simultaneously persisted on-disk via the commit log. Then once your memtable threshold is reached, the memtable contents should be flushed and persisted to the SSTables. Of course, if your node experiences a plug-out-of-the-wall event, the commit log will be verified and reconciled to ensure that its contents exist in the SSTables.
I am new to Cassandra and I would like to learn more about Cassandra's racks and structure.
Suppose I have around 70 column families in Cassandra and two AWS2 instances.
How many Data Centres will be used?
How many nodes will each rack have?
Is it possible to divide a column family in multiple keyspaces?
The intent of making Cassandra aware of logical racks and data centers is to provide additional levels of fault tolerance. The idea (as described in this document, under the "Network Topology Strategy") is that the application should still be able to function if one rack or data center goes dark. Essentially, Cassandra...
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
In this way, you can also query your data by LOCAL_QUORUM, in which QUORUM ((replication_factor / 2) + 1) is only computed from the nodes present in the same data center as the coordinator node. This reduces the effects of inter-data center latency.
As for your questions:
How many data centers are used are entirely up to you. If you only have two AWS instances, putting them in different logical data centers is possible, but only makes sense if you are planning to use consistency level ONE. As-in, if one instance goes down, your application only needs to worry about finding one other replica. But even then, the snitch can only find data on one instance, or the other.
Again, you can define the number of nodes that you wish to have for each rack. But as I indicated with #1, if you only have two instances, there isn't much to be gained by splitting them into different data centers or racks.
I do not believe it is possible to divide a column family over multiple keyspaces. But I think I know what you're getting at. Each keyspace will be created on each instance. As you have 2 instances, you will be able to specify a replication factor of 1 or 2. If you had 3 instances, you could set a replication factor of 2, and then if you lost 1 instance you would still have access to all the data. As you only have 2 instances, you need to be able to handle one going dark, so you will want to make sure both instances have a copy of every row (replication factor of 2).
Really, the logical datacenter/rack structure becomes more-useful as the number of nodes in your cluster increases. With only two, there is little to be gained by splitting them with additional logical barriers. For more information, read through the two docs I linked above:
Apache Cassandra 2.0: Data Replication
Apache Cassandra 2.0: Snitches
I have a four node, two Data Center cassandra 1.1.1 cluster.
My keyspace is RF 2 per Data center, giving me complete copy of data on each node.
The cluster is for a vendor product, which uses r/w consistency of QUORUM. With this config I can only handle the loss of one node.... How can I tweak it to handle the loss of a data center?
Unless your data centers are in the same physical location, your network overhead is going to be terrible with this configuration. The reason is because quorum consistency will not pay attention to DC when it's comparing replicas. So you will frequently have to cross data center lines before acking a read or write. Switching to local quorum would solve the latency issue but would effectively cause a data center to go down if one node goes down. However, as long as both nodes in the second DC are up (and your app can handle this properly), you will still be up and running.
Having said that, the general rule of thumb is that 3 nodes is the bare minimum per data center. If you add a node to each data center and switch to local quorum R/W, you can lose one node in each DC and still have that DC operational, or you can lose an entire DC with the other remaining operational.