how to make astyanax prefer local DC? - cassandra

I know that Astyanax has options to make it only use the local DC, but according to this link, the client will then fail if the nodes in the local DC go down. I was wondering if there was something similar to this (a configuration setting), where requests would go to nodes in the local DC if the data exists on one of the nodes, and only access cross data center nodes when absolutely necessary.

Not a configuration setting, but you could achieve it using the following workaround. Have two drivers initialized driver_dc1 and driver_dc2 in your setup where each one connects to the nodes of the pertinent data center.
try{
// perform operation using driver_dc1
}catch(ConnectionException e){
// perform operation using driver_dc2
}

Related

Which node will respond to "SELECT * FROM system.local" using the Cassandra Java driver?

I am trying to write some synchronization code for a java app that runs on each of the cassandra servers in our cluster (so each server has 1 cassandra instance + our app). For this I wanted to make a method that will return the 'local' cassandra node, using the java driver.
Every process creates a cqlSession using the local address as contactPoint. The driver will figure out the rest of the cluster from that. But my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Is there a way in the Java driver to determine which of the x nodes the process its running on?
I tried this code:
public static Node getLocalNode(CqlSession cqlSession) {
Metadata metadata = cqlSession.getMetadata();
Map<UUID, Node> allNodes = metadata.getNodes();
Row row = cqlSession.execute("SELECT host_id FROM system.local").one();
UUID localUUID = row.getUuid("host_id");
Node localNode = null;
for (Node node : allNodes.values()) {
if (node.getHostId().equals(localUUID)) {
localNode = node;
break;
}
}
return localNode;
}
But it seems to return random nodes - which makes sense if it just sends the query to one of the nodes in the cluster. I was hoping to find a way without providing hardcoded configuration to determine what node the app is running on.
my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Correct. When running a query where token range ownership cannot be determined, a coordinator is "selected." There is a random component to that selection. But it does take things like network distance and resource utilization into account.
I'm going to advise reading the driver documentation on Load Balancing. This does a great job of explaining how the load balancing policies work with the newer drivers (>= 4.10).
In that doc you will find that query routing plans:
are different for each query, in order to balance the load across the cluster;
only contain nodes that are known to be able to process queries, i.e. neither ignored nor down;
favor local nodes over remote ones.
As far as being able to tell which apps are connected to which nodes, try using the execution information returned by the result set. You should be able to get the coordinator's endpoint and hostId that way.
ResultSet rs = session.execute("select host_id from system.local");
Row row = rs.one();
System.out.println(row.getUuid("host_id"));
System.out.println();
System.out.println(rs.getExecutionInfo().getCoordinator());
Output:
9788de64-08ee-4ab6-86a6-fdf387a9e4a2
Node(endPoint=/127.0.0.1:9042, hostId=9788de64-08ee-4ab6-86a6-fdf387a9e4a2, hashCode=2625653a)
You are correct. The Java driver connects to random nodes by design.
The Cassandra drivers (including the Java driver) are configured with a load-balancing policy (LBP) which determine which nodes the driver contacts and in which order when it runs a query against the cluster.
In your case, you didn't configure a load-balancing policy so it defaults to the DefaultLoadBalancingPolicy. The default policy calculates a query plan (list of nodes to contact) for every single query so each plan is different across queries.
The default policy gets a list of available nodes (down or unresponsive nodes are not included in the query plan) that will "prioritise" query replicas (replicas which own the data) in the local DC over non-replicas meaning replicas will be contacted as coordinators before other nodes. If there are 2 or more replicas available, they are ordered based on "healthiest" first. Also, the list in the query plan are shuffled around for randomness so the driver avoids contacting the same node(s) all the time.
Hopefully this clarifies why your app doesn't always hit the "local" node. For more details on how it works, see Load balancing with the Java driver.
I gather from your post that you want to circumvent the built-in load-balancing behaviour of the driver. It seems like you have a very edge case that I haven't come across and I'm not sure what outcome you're after. If you tell us what problem you are trying to solve, we might be able to provide a better answer. Cheers!

Sharing hazelcast cache between multiple application and using write behind and read through

Question - Can I share the same hazelcast cluster (cache) between the multiple application while using the write behind and read through functionality using map store and map loaders
Details
I have enterprise environment have the multiple application and want to use the single cache
I have multiple application(microservices) ie. APP_A, APP_B and APP_C independent of each other.
I am running once instance of each application and each node will be the member node of the cluster.
APP_A has MAP_A, APP_B has MAP_B and APP_C has MAP_C. Each application has MapStore for their respective maps.
If a client sends a command instance.getMap("MAP_A").put("Key","Value") . This has some inconsistent behavior. Some time I see data is persistent in database but some times not.
Note - I wan to use the same hazelcast instance across all application, so that app A and access data from app B and vice versa.
I am assuming this is due to the node who handles the request. If request is handle by node A then it will work fine, but fails if request is handled by node B or C. I am assuming this is due to Mapstore_A implementation is not available with node B and C.
Am I doing something wrong? Is there something we can do to overcome this issue?
Thanks in advance.
Hazelcast is a clustered solution. If you have multiple nodes in the cluster, the data in each may get moved from place to place when data rebalancing occurs.
As a consequence of this, map store and map loader operations can occur from any node.
So all nodes in the cluster need the same ability to connect to the database.

Need more insight into Hazelcast Client and the ideal scenario to use it

There is already a question on the difference between Hazelcast Instance and Hazelcast client.
And it is mentioned that
HazelcastInstance = HazelcastClient + AnotherFeatures
So is it right to say client just reads and writes to the cluster formed without getting involved in the cluster? i.e. client does not store the data?
This is important to know since we can configure JVM memory as per the usage. The instances forming the cluster will be allocated more than the ones that are just connecting as a client.
It is a little bit more complicated than that. The Hazelcast Lite Member is a full-blown cluster member, without getting partitions assigned. That said, it doesn't store any data but otherwise behaves like a normal member.
Clients on the other side are simple proxies that have to forward everything to one cluster member to get any operation done. You can imagine a Hazelcast client to be something like a JDBC client, that has just enough code to connect to the cluster and redirect requests / retrieve responses.

Datastax java-driver LoadBalancingPolicy

I would like understand how to decide the load balancing policy for a heavy batch workload in cassandra using datastax java driver. I have two datacenters and I would like write into cluster with consistency ONE as quick as possible reliably to the extent possible.
How do I go about choosing the load balancing options, I see TokenAwarePolicy, LatencyAware, DCAware . Can I just use all of them?
Thanks
Srivatsan
The default LoadBalancingPolicy in the java-driver should be perfect for this scenario. The default LoadBalancingPolicy is defined as (from Policies):
public static LoadBalancingPolicy defaultLoadBalancingPolicy() {
return new TokenAwarePolicy(new DCAwareRoundRobinPolicy());
}
This will keep all requests local to the datacenter that the contact points you provide are in and will direct your requests to replicas (using round robin to balance) that have the data you are reading/inserting.
You can nest LoadBalancingPolicies, so if you would like to use all three of these policies you can simply do:
LoadBalancingPolicy policy = LatencyAwarePolicy
.builder(new TokenAwarePolicy(new DCAwareRoundRobinPolicy()))
.build();
If you are willing to use consistency level ONE, you do not care which data centre is used, so there is no need to use DCAwareRoundRobinPolicy. If you want the write to be as quick as possible, you want to minimise the latency, so you ought to use LatencyAwarePolicy; in practice this will normally select a node at the local data centre, but will use a remote node if it is likely to provide better performance, such as when a local node is overloaded. You also want to minimize the number of network hops, so you want to use one of the storage nodes for the write as the coordinator for the write, so you should use TokenAwarePolicy. You can chain policies together by passing one to the constructor call of another.
Unfortunately, the Cassandra driver does not provide any directly useful base policy for you to use as the child policy of LatencyAwarePolicy or TokenAwarePolicy; the choices are DCAwareRoundRobinPolicy, RoundRobinPolicy and WhiteListPolicy. However, if you use RoundRobinPolicy as the child policy, the LatencyAwarePolicy should, after the first few queries, acquire the latency information it needs.

Which Cassandra node should I connect to?

I might be misunderstanding something here, as it's not clear to me how I should connect to a Cassandra cluster. I have a Cassandra 1.2.1 cluster of 5 nodes managed by Priam, on AWS. i would like to use Astyanax to connect to this cluster by using a code similar to the code bellow:
conPool = new ConnectionPoolConfigurationImpl(getConecPoolName()) .setMaxConnsPerHost(CONNECTION_POOL_SIZE_PER_HOST).setSeeds(MY_IP_SEEDS)
.setMaxOperationsPerConnection(100) // 10000
What should I use as MY_IP_SEEDS? Should I use the IPs of all my nodes split by comma? Or should I use the IP of just 1 machine (the seed machine)? If I use the ip of just one machine, I am worried about overloading this machine with too many requests.
I know Priam has the "get_seeds" REST api (https://github.com/Netflix/Priam/wiki/REST-API) that for each node returns a list of IPs and I also know there is one seed per RAC. However, I am not sure what would happen if the seed node gets down... I would need to connect to others when trying to make new connections, right?
Seed nodes are only for finding the way into the cluster on node startup - no overload problems.
Of course one of the nodes must be reachable and up in the cluster to get the new one up and running.
So the best way is to update the seed list from Priam before starting the node. Priam should be behind an automatically updated DNS entry.
If you're highest availability you should regularly store the current list of seeds from Priam and store them in a mirrored fashion just as you store your puppet or chef config to be able to get nodes up even when Priam isn't reachable.

Resources