Can I use data grid over WAN? - gridgain

I am new for GridGain and currently I am using GridGain 6.20(open source version). I am wondering if I can use data grid over WAN(through Internet, across multiple data centers).
For example, I have four nodes: N1, N2, N3 and N4. N1 and N2 are in San Jose Data Center. N3 and N4 are in San Francisco Data Center. I plan to build a data grid(a cluster) including N1, N2, N3 and N4. As you can see, it is over WAN.
Thanks,
Bill

There are several ways of doing this:
Simply open ports over WAN. GridGain will work, but this is probably not too kosher from security standpoint.
Open VPN connection between data centers and use standard GridGain configuration.
If you are using multiple data centers for fail-over purposes, then you can use GridGain Data Center Replication feature - http://doc.gridgain.org/latest/Data+Center+Replication

Related

How store data on servers close to the users location in Cassandra?

I'm currently thinking about the best way to handle customers all over the world with Cassandra. I assume I have got servers in America and Europe. Is there a mechanism to influence on which servers the data is stored? For a user in the US the data should be hosted on the american server and only one safe copy in Europe. In general I was thinking of UUIDs which can be identified as a certain location. For example if the last bit is set, it should be on a server in the US and otherwise in Europe. Then I was thinking auf writing a custom Partitioner, which assigns the value to a value within the range of the American servers. For example if it starts with the 00-7f is Europe and 80-ff is in America. So I could use a normal Murmur3 which sets the first bit based on the location information in the UUID. Can I influence the partition range of a certain server? Especially with virtual nodes this might get complicated, I think. Is there a way to achieve server location based partition selection like desired? How would you try to solve this problem?
Right now, you'd need to have different keyspaces for each region. Once that's done, you can set the replication strategy to NetworkTopologyStrategy, and then use NTS to set the replication factor to match your expectation of data locality.
There's an open issue ( CASSANDRA-7306 ) which proposed adding the opportunity to control locality in the way you describe. Right now, there's no indication that it's close to working, or that it'll be implemented in the near future, so the de facto way to implement this is with NetworkTopologyStrategy and configuring replication factor appropriately.

Configuring specific backup nodes for a Hazelcast Distributed Map?

Imagine that an organization has two data centers (named 'A' and 'B' for simplicity) with multiple nodes running in each, and there is a Hazelcast cluster over all of these nodes. Assume that there is a Distributed Map in this cluster, which is configured to have backup-count of 1.
Is there a way to configure the Hazelcast Distributed Map so that nodes in Data Center A are backed up on the nodes in Data Center B and vice versa? This would mean that the event of losing a single data center the Map data (and backup) is not lost?
what you want is called Partition Grouping. See documentation for details.
The simplest thing you can do is to include this snippet in your Hazelcast configuration:
<partition-group enabled="true" group-type="CUSTOM">
<member-group>
<interface>10.10.1.*</interface> <!-- network in data centre A -->
</member-group>
<member-group>
<interface>10.10.2.*</interface> <!-- network in data centre B -->
</member-group
</partition-group>
Another option is to create own cluster in each data centre and connect them via WAN replications. This will decrease latencies within the data centre, but it can produce conflicting updates and then it's up to your MergePolicy to deal with it.

Data partitioning in Cassandra for multiple datacenters with varying data

So far, I've been through data partitioning in Cassandra and found some basic ways of doing things, like if you have 6 nodes, with 3 each in two separate data centers, we have the following method of data replication:
Data replication occurs by parsing through nodes until Cassandra comes across a node in the ring belonging to another data center and places the replica there, repeating the process until all data centers have one copy of the node - as per NetworkTopologyStrategy.
SO, we have two copies of the entire data with one in each data center. But, what if I wanted to logically split data into two separate chunks, based on some attribute like business or geographic location.(Data for India in India DataCenter). So, we would have a chunk of data in datacenters across one geographic location, another chunk in another location and none of them overlapping.
Would that be possible?
And given the application of Cassandra and Big Data in general, would that make sense?
Geographic sharding is certainly possible. You simply run multiple data centers that aren't connected, then they won't replicate. Alternatively, you can have them replicate, but your India-based app only reads and writes to your India DC. Whether it makes sense depends on your application.

Cassandra rack concept and database structure

I am new to Cassandra and I would like to learn more about Cassandra's racks and structure.
Suppose I have around 70 column families in Cassandra and two AWS2 instances.
How many Data Centres will be used?
How many nodes will each rack have?
Is it possible to divide a column family in multiple keyspaces?
The intent of making Cassandra aware of logical racks and data centers is to provide additional levels of fault tolerance. The idea (as described in this document, under the "Network Topology Strategy") is that the application should still be able to function if one rack or data center goes dark. Essentially, Cassandra...
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
In this way, you can also query your data by LOCAL_QUORUM, in which QUORUM ((replication_factor / 2) + 1) is only computed from the nodes present in the same data center as the coordinator node. This reduces the effects of inter-data center latency.
As for your questions:
How many data centers are used are entirely up to you. If you only have two AWS instances, putting them in different logical data centers is possible, but only makes sense if you are planning to use consistency level ONE. As-in, if one instance goes down, your application only needs to worry about finding one other replica. But even then, the snitch can only find data on one instance, or the other.
Again, you can define the number of nodes that you wish to have for each rack. But as I indicated with #1, if you only have two instances, there isn't much to be gained by splitting them into different data centers or racks.
I do not believe it is possible to divide a column family over multiple keyspaces. But I think I know what you're getting at. Each keyspace will be created on each instance. As you have 2 instances, you will be able to specify a replication factor of 1 or 2. If you had 3 instances, you could set a replication factor of 2, and then if you lost 1 instance you would still have access to all the data. As you only have 2 instances, you need to be able to handle one going dark, so you will want to make sure both instances have a copy of every row (replication factor of 2).
Really, the logical datacenter/rack structure becomes more-useful as the number of nodes in your cluster increases. With only two, there is little to be gained by splitting them with additional logical barriers. For more information, read through the two docs I linked above:
Apache Cassandra 2.0: Data Replication
Apache Cassandra 2.0: Snitches

How to handle QUORUM consistency in 4 node, 2 DC Cassandra cluster

I have a four node, two Data Center cassandra 1.1.1 cluster.
My keyspace is RF 2 per Data center, giving me complete copy of data on each node.
The cluster is for a vendor product, which uses r/w consistency of QUORUM. With this config I can only handle the loss of one node.... How can I tweak it to handle the loss of a data center?
Unless your data centers are in the same physical location, your network overhead is going to be terrible with this configuration. The reason is because quorum consistency will not pay attention to DC when it's comparing replicas. So you will frequently have to cross data center lines before acking a read or write. Switching to local quorum would solve the latency issue but would effectively cause a data center to go down if one node goes down. However, as long as both nodes in the second DC are up (and your app can handle this properly), you will still be up and running.
Having said that, the general rule of thumb is that 3 nodes is the bare minimum per data center. If you add a node to each data center and switch to local quorum R/W, you can lose one node in each DC and still have that DC operational, or you can lose an entire DC with the other remaining operational.

Resources