I would like understand how to decide the load balancing policy for a heavy batch workload in cassandra using datastax java driver. I have two datacenters and I would like write into cluster with consistency ONE as quick as possible reliably to the extent possible.
How do I go about choosing the load balancing options, I see TokenAwarePolicy, LatencyAware, DCAware . Can I just use all of them?
Thanks
Srivatsan
The default LoadBalancingPolicy in the java-driver should be perfect for this scenario. The default LoadBalancingPolicy is defined as (from Policies):
public static LoadBalancingPolicy defaultLoadBalancingPolicy() {
return new TokenAwarePolicy(new DCAwareRoundRobinPolicy());
}
This will keep all requests local to the datacenter that the contact points you provide are in and will direct your requests to replicas (using round robin to balance) that have the data you are reading/inserting.
You can nest LoadBalancingPolicies, so if you would like to use all three of these policies you can simply do:
LoadBalancingPolicy policy = LatencyAwarePolicy
.builder(new TokenAwarePolicy(new DCAwareRoundRobinPolicy()))
.build();
If you are willing to use consistency level ONE, you do not care which data centre is used, so there is no need to use DCAwareRoundRobinPolicy. If you want the write to be as quick as possible, you want to minimise the latency, so you ought to use LatencyAwarePolicy; in practice this will normally select a node at the local data centre, but will use a remote node if it is likely to provide better performance, such as when a local node is overloaded. You also want to minimize the number of network hops, so you want to use one of the storage nodes for the write as the coordinator for the write, so you should use TokenAwarePolicy. You can chain policies together by passing one to the constructor call of another.
Unfortunately, the Cassandra driver does not provide any directly useful base policy for you to use as the child policy of LatencyAwarePolicy or TokenAwarePolicy; the choices are DCAwareRoundRobinPolicy, RoundRobinPolicy and WhiteListPolicy. However, if you use RoundRobinPolicy as the child policy, the LatencyAwarePolicy should, after the first few queries, acquire the latency information it needs.
Related
I am trying to write some synchronization code for a java app that runs on each of the cassandra servers in our cluster (so each server has 1 cassandra instance + our app). For this I wanted to make a method that will return the 'local' cassandra node, using the java driver.
Every process creates a cqlSession using the local address as contactPoint. The driver will figure out the rest of the cluster from that. But my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Is there a way in the Java driver to determine which of the x nodes the process its running on?
I tried this code:
public static Node getLocalNode(CqlSession cqlSession) {
Metadata metadata = cqlSession.getMetadata();
Map<UUID, Node> allNodes = metadata.getNodes();
Row row = cqlSession.execute("SELECT host_id FROM system.local").one();
UUID localUUID = row.getUuid("host_id");
Node localNode = null;
for (Node node : allNodes.values()) {
if (node.getHostId().equals(localUUID)) {
localNode = node;
break;
}
}
return localNode;
}
But it seems to return random nodes - which makes sense if it just sends the query to one of the nodes in the cluster. I was hoping to find a way without providing hardcoded configuration to determine what node the app is running on.
my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Correct. When running a query where token range ownership cannot be determined, a coordinator is "selected." There is a random component to that selection. But it does take things like network distance and resource utilization into account.
I'm going to advise reading the driver documentation on Load Balancing. This does a great job of explaining how the load balancing policies work with the newer drivers (>= 4.10).
In that doc you will find that query routing plans:
are different for each query, in order to balance the load across the cluster;
only contain nodes that are known to be able to process queries, i.e. neither ignored nor down;
favor local nodes over remote ones.
As far as being able to tell which apps are connected to which nodes, try using the execution information returned by the result set. You should be able to get the coordinator's endpoint and hostId that way.
ResultSet rs = session.execute("select host_id from system.local");
Row row = rs.one();
System.out.println(row.getUuid("host_id"));
System.out.println();
System.out.println(rs.getExecutionInfo().getCoordinator());
Output:
9788de64-08ee-4ab6-86a6-fdf387a9e4a2
Node(endPoint=/127.0.0.1:9042, hostId=9788de64-08ee-4ab6-86a6-fdf387a9e4a2, hashCode=2625653a)
You are correct. The Java driver connects to random nodes by design.
The Cassandra drivers (including the Java driver) are configured with a load-balancing policy (LBP) which determine which nodes the driver contacts and in which order when it runs a query against the cluster.
In your case, you didn't configure a load-balancing policy so it defaults to the DefaultLoadBalancingPolicy. The default policy calculates a query plan (list of nodes to contact) for every single query so each plan is different across queries.
The default policy gets a list of available nodes (down or unresponsive nodes are not included in the query plan) that will "prioritise" query replicas (replicas which own the data) in the local DC over non-replicas meaning replicas will be contacted as coordinators before other nodes. If there are 2 or more replicas available, they are ordered based on "healthiest" first. Also, the list in the query plan are shuffled around for randomness so the driver avoids contacting the same node(s) all the time.
Hopefully this clarifies why your app doesn't always hit the "local" node. For more details on how it works, see Load balancing with the Java driver.
I gather from your post that you want to circumvent the built-in load-balancing behaviour of the driver. It seems like you have a very edge case that I haven't come across and I'm not sure what outcome you're after. If you tell us what problem you are trying to solve, we might be able to provide a better answer. Cheers!
When using cassandra datastax java driver, When can I use multiple sessions under same cluster? I am not able to find any good usecase for having a cluster and multiple sessions.
My application have multiple components/modules that accesses Cassandra. Based on the answer I may decide Should I be having one session per component/module or just one session shared across all the components of my application.
Update: Everywhere on the internet they recommend to use one session. I get it, but my question is "in what scenario do you create multiple sessions for one cluster?". If there is no such scenario, why the library allows to create multiple sessions, instead the library can just have a method to return a singleton session object.
Use Just One Session across all your component.
Because In Cassandra Session is a heavy object. Thread-safe. It maintain multiple connection, cached prepared statement etc.
Here is the JavaDoc :
A session holds connections to a Cassandra cluster, allowing it to be queried. Each session maintains multiple connections to the cluster nodes, provides policies to choose which node to use for each query (round-robin on all nodes of the cluster by default), and handles retries for failed query (when it makes sense), etc...
Session instances are thread-safe and usually a single instance is enough per application. As a given session can only be "logged" into one keyspace at a time (where the "logged" keyspace is the one used by query if the query doesn't explicitely use a fully qualified table name), it can make sense to create one session per keyspace used. This is however not necessary to query multiple keyspaces since it is always possible to use a single session with fully qualified table name in queries.
Source :
https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/Session.html
https://ahappyknockoutmouse.wordpress.com/2014/11/12/246/
I am running an ExecutorService of more than 50 threads concurrently. Each thread is opening a connection to Cassandra and performing inserts using springframework.data.cassandra. The problem is when I open more than 50 connections at a time, I get the following error.
Caused by: org.jboss.netty.channel.ChannelException: Failed to create a selector.
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.openSelector(AbstractNioSelector.java:343)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.<init>(AbstractNioSelector.java:100)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.<init>(AbstractNioWorker.java:52)
at org.jboss.netty.channel.socket.nio.NioWorker.<init>(NioWorker.java:45)
at org.jboss.netty.channel.socket.nio.NioWorkerPool.createWorker(NioWorkerPool.java:45)
at org.jboss.netty.channel.socket.nio.NioWorkerPool.createWorker(NioWorkerPool.java:28)
at org.jboss.netty.channel.socket.nio.AbstractNioWorkerPool.newWorker(AbstractNioWorkerPool.java:143)
at org.jboss.netty.channel.socket.nio.AbstractNioWorkerPool.init(AbstractNioWorkerPool.java:81)
at org.jboss.netty.channel.socket.nio.NioWorkerPool.<init>(NioWorkerPool.java:39)
at org.jboss.netty.channel.socket.nio.NioWorkerPool.<init>(NioWorkerPool.java:33)
at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.<init>(NioClientSocketChannelFactory.java:151)
at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.<init>(NioClientSocketChannelFactory.java:116)
at com.datastax.driver.core.Connection$Factory.<init>(Connection.java:532)
at com.datastax.driver.core.Cluster$Manager.<init>(Cluster.java:1201)
at com.datastax.driver.core.Cluster$Manager.<init>(Cluster.java:1144)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:121)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:108)
at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:177)
at com.datastax.driver.core.Cluster$Builder.build(Cluster.java:1109)
If I open exactly 50 threads (or less), it works fine. Is there a way to configure this so I can allow more? In my cassandra.yaml file, rpc_max_threads according to the comments by default "The default is unlimited"
My guess is you are overwhelming your OS by creating too many connections. You should only create 1 Cluster instance per Cassandra cluster. Clusters create Sessions, which manage their own connection pools. Both Cluster and Session are thread safe, so you can share them between threads.
Four simple rules for coding with the driver distills these concepts well:
When writing code that uses the driver, there are four simple rules that you should follow that will also make your code efficient:
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
...
A Cluster instance allows to configure different important aspects of the way connections and queries will be handled. At this level you can configure everything from contact points (address of the nodes to be contacted initially before the driver performs node discovery), the request routing policy, retry and reconnection policies, and so forth. Generally such settings are set once at the application level.
While the session instance is centered around query execution, the Session it also manages the per-node connection pools. The session instance is a long-lived object, and it should not be used in a request-response, short-lived fashion. The code should share the same cluster and session instances across your application.
My team is testing the token aware connection pool of Astyanax. How can we measure effectiveness of the connection pool type, i.e. how can we know how the tokens are distributed in a ring and how client connections are distributed across them?
Our initial tests by counting the number of open connection on network cards show that only 3 out of 4 or more Cassandra instances in a ring are used and the other nodes participate in request processing in a very limited scope.
What other information would help making a valid judgment/verification? Is there an Cassandra/Astyanax API or command line tools to help us out?
Use Opscenter. This will show you how balanced your cluster is, i.e. whether each node has the same amount of data, as well asbeing able to graph the incoming read / write request per node and for your entire cluster. It is free and works with open source Cassandra as well as DSE. http://www.datastax.com/what-we-offer/products-services/datastax-opscenter
I know that Astyanax has options to make it only use the local DC, but according to this link, the client will then fail if the nodes in the local DC go down. I was wondering if there was something similar to this (a configuration setting), where requests would go to nodes in the local DC if the data exists on one of the nodes, and only access cross data center nodes when absolutely necessary.
Not a configuration setting, but you could achieve it using the following workaround. Have two drivers initialized driver_dc1 and driver_dc2 in your setup where each one connects to the nodes of the pertinent data center.
try{
// perform operation using driver_dc1
}catch(ConnectionException e){
// perform operation using driver_dc2
}