I saw that Cassandra client needs an array of hosts.
For example, Python uses this:
from cassandra.cluster import Cluster
cluster = Cluster(['192.168.0.1', '192.168.0.2'])
source: http://datastax.github.io/python-driver/getting_started.html
Question 1: Why do I need to pass these nodes?
Question 2: Do I need to pass all nodes? Or is one sufficient? (All nodes have the information about all other nodes, right?)
Question 3: Does the client choose the best node to connect knowing all nodes? Does the client know what data is stored in each node?
Question 4: I'm starting to use Cassandra for the first time, and I'm using Kubernetes for the first time. I deployed a Cassandra cluster with 3 Cassandra nodes. I deployed another one machine and in this machine, I want to connect to Cassandra by a Python Cassandra client. Do I need to pass all the Cassandra IPs to Python Cassandra client? Or is it sufficient to put the Cassandra DNS given by Kubernetes?
For example, when I run a dig command, I know all the Cassandra IPs. I don't know if it's sufficient to pass this DNS to the client
# dig cassandra.default.svc.cluster.local
The IPs are 10.32.1.19, 10.32.1.24, 10.32.2.24
; <<>> DiG 9.10.3-P4-Debian <<>> cassandra.default.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18340
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;cassandra.default.svc.cluster.local. IN A
;; ANSWER SECTION:
cassandra.default.svc.cluster.local. 30 IN A 10.32.1.19
cassandra.default.svc.cluster.local. 30 IN A 10.32.1.24
cassandra.default.svc.cluster.local. 30 IN A 10.32.2.24
;; Query time: 2 msec
;; SERVER: 10.35.240.10#53(10.35.240.10)
;; WHEN: Thu Apr 04 16:08:06 UTC 2019
;; MSG SIZE rcvd: 125
What are the disadvantages of using for example:
from cassandra.cluster import Cluster
cluster = Cluster(['cassandra.default.svc.cluster.local'])
Question 1: Why do I need to pass these nodes?
To make initial contact with the cluster. If the connection is made then there is no use with these contact points.
Question 2: Do I need to pass all nodes? Or is one sufficient? (All
nodes have the information about all other nodes, right?)
You can pass only one node as contact point but the problem is if that node is down when the driver tries to contact then, it won't be able to connect to cluster. So if you provide another contact point it will try to connect with it even if the first one failed. It would be better if you use your Cassandra seed list as contact points.
Question 3: Does the client choose the best node to connect knowing
all nodes? Does the client know what data is stored in each node?
Once the initial connection is made the client driver will have the metadata about the cluster. The client will know what data is stored in each node and also which node can be queried with less latency. you can configure all these using load balancing policies
Refer: https://docs.datastax.com/en/developer/python-driver/3.10/api/cassandra/policies/
Question 4: I'm starting to use cassandra for first time, and I'm
using kubernetes for the first time. I deployed a cassandra cluster
with 3 cassandra nodes. I deployed another one machine and in this
machine I want to connect to cassandra by a Python Cassandra client.
Do I need to pass all cassandra IPs to Python Cassandra client? Or is
it sufficient to put the cassandra DNS given by Kubernetes?
If the hostname can be resolved then it is always better to use DNS instead of IP. I don't see any disadvantage.
Related
In datastax driver 3.x,we've DCAwareRoundRobin policy,which tries to connect to remote nodes if nodes in local datacenter fails.In the datastax drvier 4.x,we donot have that policy and confines to local-only.But,in the datastax docs,it's mentioned as:
Cross-datacenter failover is enabled with the following configuration option:
datastax-java-driver.advanced.load-balancing-policy.dc-failover {
max-nodes-per-remote-dc = 2
}
The driver will then attempt to open connections to nodes in remote datacenter.But,in the driver,we specify only a single datacenter to connect to as below:
CqlSession session = CqlSession.builder()
.addContactPoint(new InetSocketAddress("1.2.3.4", 9042))
.addContactPoint(new InetSocketAddress("5.6.7.8", 9042))
.withLocalDatacenter("datacenter1")
.build();
How the connection to remote datacenter is handled?Please help..
The quick answer is you don't do it programatically -- the Java driver does it for you after you've enabled advanced.load-balancing-policy.dc-failover.
The contact points are just the initial hosts the driver "contacts" to discover the cluster topology. After it has collected metadata about the cluster, it knows about nodes in remote DCs.
Since you've configured max-nodes-per-remote-dc = 2, the driver will add 2 nodes from each remote DC to the end of the query plan. The nodes in the local DC will be listed first then followed by the remote nodes. If the driver can't contact the nodes in the local DC, then it will start contacting the remote nodes in the query plan, one node at a time until it runs out of nodes to contact.
But I have to reiterate what I said in your other question, we do not recommend enabling DC-failover. For anyone else coming across this answer, you've been warned. Cheers!
I have 10 Cassandra Nodes running on Kubernetes on my server and 1 contact point that expose the service on port 10023.
However, when the datastax driver tries to establish a connection with the other nodes of the cluster it uses the exposed port instead of the default one and i get the following error:
com.datastax.driver.core.ConnectionException: [/10.210.1.53:10023] Pool was closed during initialization
Is there a way to expose one single contact point and have it to communicate with the other nodes on the standard port (9042)?
i checked on the datastax documentation if there is anything related to it but i didn't find much.
this is how i connect to the cluster
Cluster.Builder builder = Cluster.builder();
builder.addContactPoints(address)
.withPort(Integer.valueOf(10023))
.withCredentials(user, password)
.withMaxSchemaAgreementWaitSeconds(600)
.withSocketOptions(
new SocketOptions()
.setConnectTimeoutMillis(Integer.valueOf(timeout))
.setReadTimeoutMillis(Integer.valueOf(timeout))
).build();
Cluster cluster = builder.withoutJMXReporting().build();
Session session = cluster.connect();
After driver contacts first node, it fetches information about cluster, and use this information, and this information includes on what ports Cassandra listens.
To implement what you want to do, you need that Cassandra listened on the corresponding port - this is configured via native_transport_port parameter of the cassandra.yaml.
Also, by default Cassandra driver will try to connect to all nodes in cluster because it uses DCAware/TokenAware load balancing policy. If you want to use only one node, then you need to use WhiteListPolicy instead of default policy. But is not optimal from the performance point of view.
I would suggest to re-think how you expose Cassandra to clients.
Let's say I have 4 nodes: host1, host2, host3 and host4. However I only add host1 and host2 as Contact hosts. What would happen if I perform any operation in DevCenter? Will the action propagate to host3 and host4? Will this cause data corruption?
Here's what will happen:
DevCenter will use the Whitelist load balancing policy 1 to connect to the provided nodes
While DevCenter uses the DataStax Java driver as the underlying connector, it does use the above mentioned load balancing policy to reduce the time needed to obtain connections (instead of the default driver's load balancing policy which requires discovering all the nodes in the cluster and initiating connection pools to all those)
DevCenter will send the request to the nodes in the list you provided
If data is local to these nodes they will take care of the requests. If data is found on the other nodes in the cluster, the nodes used for the connection will act as coordinators (basically they'll relay the requests to the nodes having the data)
Bottom line there's no risk of data corruption and the results you get will be exactly the same as for connecting to all the nodes.
I've been looking to find how to configure a client to connect to a Cassandra cluster.
Independent of clients like Pelops, Hector, etc, what is the best way to connect to a multi-node Cassandra cluster?
Sending the string IP values works fine, but what about growing number cluster nodes in the future? Is maintaining synchronically ALL IP cluster nodes on client part?
Don't know if this answer all your questions but the growing cluster and your knowledge of clients ip are not related.
I have a 5 node cluster but the client(s) only knows 2 ip addresses: the seeds. Since each machine of the cluster knows about the seeds (each cassandra.yaml contains the seeds ip address) if new machine will be added information about new one will come "for free" on the client side.
Imagine a 5 nodes cluster with following ips
192.168.1.1
192.168.1.2 (seed)
192.168.1.3
192.168.1.4 (seed)
192.168.1.5
eg: the node .5 boot -- it will contact the seeds (node 2 and 4) and receive back information about the whole cluster. If you add a new 192.168.1.6 will behave exactly like the .5 and will point to the seeds to know the cluster situation. On the client side you don't have to change anything: you will just know that now you have 6 endpoints instead of 5.
ps: you don't have necessarily to connect to the seeds you can just connect to any node of since after having contacted the seeds each node knows the whole cluster topology
pps: it's your choice how many nodes to put in you "client known hosts", you can also put all 5 but this won't change the fact that if one node will be added you don't need to do anything on the client side
Regards,
Carlo
You will have an easier time letting the client track the state of each node. Smart clients will track endpoint state via the gossipinfo, which passes on new nodes as they appear in the cluster.
Currently i'm configuring a server pool with AWS. It is a simple setup with two database servers an scalable server array and two load balancers in front of it all. Every machine has a failover standing by and it should all be pretty robust.
The load balancers should be able to failover through Round Robin DNS. So in a happy day scenario both machines get hit and distribute the traffic over the array. When one of these machines is down Round Robin DNS in combination with client browser retry should make it so that browsers should shift their target host to the machine which is still up once they hit a timeout. This is not something I came up with but seems like a very good solution.
The problem i'm experiencing is as following. The shift does actually happen but not just once for the failed request but for each and every subsequent request from the same browser. So a simple page request takes 21 seconds to load after which all images also take 21 seconds to load. All the following page request also takes this long. So the failover works but is a the same time completely useless.
Output from a dig:
; <<>> DiG 9.6.1-P2 <<>> example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45224
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;example.com. IN A
;; ANSWER SECTION:
www.example.com. 86400 IN A 1.2.3.4
www.example.com. 86400 IN A 1.2.3.4
;; Query time: 31 msec
;; SERVER: 172.16.0.23#53(172.16.0.23)
;; WHEN: Mon Dec 20 12:21:25 2010
;; MSG SIZE rcvd: 67
Thanks in advance!
Maarten Hoekstra
Kingsquare Information Services
When the DNS server gives a list of IP addresses to the client, this list will be ordered (possibly in a rotating manner, i.e. subsequent DNS might return them in a different order). It is likely that the browser caches the DNS response, i.e. the list it originally received. It then does not assume that a failed connection means that the server is down, but will retry the list in the same order every time.
So round-robin DNS is for load balancing at best; it is not very well suited to support fault tolerance.
There is a reason we call this "poor man's load balancing." It does work but you are the mercy of the resolver, and the time outs depending upon which IP is returned first from your dns servers. You can look at something like dnsmadeeasy.com and their dns failover (there are others that do this, but dnsmadeeasy is the one I know of). Basically they monitor app availability and can fast flux the dns changes in regards to application state..