I am new to cassandra database , i have configured multi-node cassandra , configured like one node per machine
I want to access the cassandra db using one name ( like Scan name in Oracle RAC) from client applications is it possible to do it in that manner,
Our developers are used to the Oracle RAC scan name they are expecting something similer thing in multi node cassandra as well
At the moment I can access the induvidual nodes using seperate IP's which i have assigned
Can any one help me on this
What I think your describing is service discovery. You should have a look at Consul (https://www.consul.io/) to define the contact points for your application to connect to Cassandra.
Essentially, you can use Consul to manage a single DNS entry (https://www.consul.io/docs/agent/dns.html) for your applications to use without having to hardcode an IP etc.
Related
When in Java, I create a Cassandra cluster builder, I provide a list of multiple Cassandra nodes as shown below:
Cluster cluster = Cluster.builder().addContactPoint(host1, host2, host3, host4).build();
But from what I understand, the connector connects only to the first host in the list that is available, and that host becomes my connection point to the Cassandra cluster.
Now, my question is if my Java application reads/writes huge amount of data from/to Cassandra, then doesn't my Java application overwhelm the node that it is connected to?
Is there a way to configure my connection such that it uses multiple nodes of Cassandra for its reads/writes? What is the common practice?
It uses the contact point to find the rest of the nodes in the cluster, then creates a pool of connections to all the hosts and balances the requests among them. It doesn't only connect to the hosts you provide unless you use the whitelist load balancing policy or a custom one.
If your worried about overwhelming nodes use the RoundRobinLoadBalancingPolicy (DC aware if multiple DCs) and it will distribute it amongst all of them evenly. If you have hot spots of data and use the TokenAware policy you may have it uneven, but you shouldn't need to worry about it.
Datastax's JAVA driver needs to have access to all the servers in the cluster even in a multi DC set up. This seems to be a problem when we want to localise the queries. Is there a way to do it??
It depends on the LoadBalancingPolicy that you use. By default the driver uses a DCAwareRoundRobinPolicy with Token awareness and chooses to connect to no hosts in remote datacenters, but chooses which datacenter is local based on your contact points.
You can configure a DCAwareRoundRobinPolicy to explicitly specify the local datacenter and that you want to connect to 0 hosts in remote DCs, i.e.:
Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1")
.withLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy("mydc", 0)))
.build();
Note that the wrapping TokenAwarePolicy is not required, but it is a nice to have to have the driver choose coordinators that own data that you are storing/querying for handling requests.
I've been using Couchbase for my database solution and so far it looks very good.
I'm confused however with connecting to a Cluster. A Cluster is just a group of nodes so when you use the API to connect to a Cluster what do you use as the IP? Do you just use one of the nodes in the Cluster? Does it matter which one?
I'm personally using the Node.js API.
Technically all you need is just one node in the list. As soon as it connects to that one, it will get the cluster map of the entire cluster and know all of the rest of the nodes. No it does not matter which node.
That being said, best practice is to have at least 3 nodes of the cluster listed in the connection string or better yet if the SDK you are using supports it, use a DNS SRV record with at least 3 nodes in there. With three nodes in the list if for some reason (e.g. server failure or maintenance) one of the nodes is unavailable, you can still bootstrap an application server to get that cluster map with one of the other nodes in the list.
I asked this question a few months ago on couchbase forums and the author of the node.js module answered that you should use "some" of them
like :
cluster.openBucket("couchbase://server1,server2,server3", function(err) {});
if you have server4 and 5 are added , they will be automatically added to the cluster as soon as they are available in the cluster.
Check here for details : https://forums.couchbase.com/t/couchnode-connection-to-cluster/6281
Is there a possibility to write to a particular node using datastax driver?
For example, I have three nodes in datacenter 1 and three nodes in datacenter 2.
Existing
If i build up the cluster with any one of them as seed, all the nodes will get detected by the datastax java driver. So, in this case, if i insert a data using driver, it will automatically choose one of the nodes and proceed with it as the co-ordinator(preferably local data center)
Requirement
I want a way to contact any node in datacenter 2 and hand over the co-ordinator job to one of the nodes in datacenter 2.
Why i need this
I am trying to use the trigger functionality from datacenter 2 alone. Since triggers are taken care by co-ordinator , i want a co-ordinator to be selected from datacenter 2 so that data center 1 doesnt have to do this operation.
You may be able to use the DCAwareRoundRobinPolicy load balancing policy to achieve this by creating the policy such that DC2 is considered the "local" DC.
Cluster.Builder builder = Cluster.builder().withLoadBalancingPolicy(new DCAwareRoundRobinPolicy("dc2"));
In the above example, remote (non-DC2) nodes will be ignored.
There is also a new WhiteListPolicy in driver version 2.0.2 that wraps another load balancing policy and restricts the nodes to a specific list you provide.
Cluster.Builder builder = Cluster.builder().withLoadBalancingPolicy(new WhiteListPolicy(new DCAwareRoundRobinPolicy("dc2"), whiteList));
For multi-DC scenarios Cassandra provides EACH and LOCAL consistency levels where EACH will acknowledge successful operation in each DC and LOCAL only in local one.
If I understood correctly, what you are trying to achieve is DC failover in your application. This is not a good practice. Let's assume your application is hosted in DC1 alongside with Cassandra. If DC1 goes down, your entire application is unavailable. If DC2 goes down, your application still can write with LOCAL CL and C* will replicate changes when DC2 is back.
If you want to achieve HA, you need to deploy application in each DC, use CL=LOCAL_X and finally do failover on DNS level (e.g. using AWS Route53).
See data consistency docs and this blog post for more info about consistency levels for multiple DCs.
I'm using the Cassandra CQL/JDBC driver I got from google code but it doesn't seem to let me provide a cluster name - is there a way?
I'm using cluster names to ensure I don't run commands against a live system, it has a different cluster name to my dev systems.
Edit: Just to clarify, I have two totally separate Cassandra clusters, one live and one for test. They have different cluster names to ensure that I don't accidentally run test code meant for the test cluster on the live cluster. Therefore any client I need to use must let me set a cluster name. Hector does this.
There is no inbuilt protection for checking cluster names for Cassandra clients. It is built to ensure nodes from different clusters don't try and join together but not to ensure clients connect to the right cluster. It would be possible to add this checking to a client though (since the cluster name is exposed to the client) but I'm not aware of any clients doing this.
I'd strongly recommend firewalling off your different environments to avoid this kind of mistake. If that isn't possible, you should choose different ports to avoid confusion. Change this with the 'rpc_port' setting in cassandra.yaml.
You'd have to mirror the data on two different clusters. You cant access the same cluster with different names.
To rename your cluster (from the default 'Test Cluster') you edit the cassandra configuration file found in location/of/cassandra/conf/cassandra.yaml. Its the top line, if you need more details look at the datastax configuration documentation and explanation.