How to leave a Hazelcast cluster gracefully? - hazelcast

Currently, when I remove a node (e.g. ip-2) I simply call HazelcastInstance.shutdown(). But I still end up seeing a lot of warnings in the logs, e.g.
[ip-1]:5701 [xxx] [3.3.3] Removing connection to endpoint Address[ip-2]:5701 Cause => java.net.SocketException {Connection refused to address /ip-2:5701}, Error-Count: 5
[ip-1]:5701 [xxx] [3.3.3] This node does not have a connection to Member [ip-2]:5701
[ip-1]:5701 [xxx] [3.3.3] hz._hzInstance_1_xxx.IO.thread-in-0 Closing socket to endpoint Address[ip-2]:5701, Cause:java.io.EOFException: Remote socket closed!
Is there a more proper way to remove nodes from a cluster?

This is the recommended way. I guess the logging is a bit confusing.

Related

Why does Cassandra Kerberos Connection to second keyspace fail?

We are trying to connect to two keyspaces of Cassandra (3.x) in the same application with the same Kerberos credentials. The application is able to connect to one keyspace but no the other. Access to the keyspaces has been verified.
Error on connection:
2022-08-22 13:15:10,972 [cluster-reconnection-0] DEBUG c.d.d.c.ControlConnection [--]- [Control connection] error on 169.24.167.109:9042 connection, trying next host
javax.security.auth.login.LoginException: No LoginModules configured for CassandraJavaClient
at javax.security.auth.login.LoginContext.init(LoginContext.java:264)
at javax.security.auth.login.LoginContext.<init>(LoginContext.java:417)
The ticket cache is :
CassandraJavaClient {
com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true ticketCache="/var//krb5cc_userlogin";
};
The same ticket cache file is used by the first connection - which succeeds. While the second connection fails. I am not even sure as to how to debug it (tried remote debugging and since the initial control connection is an Async call, unable to get to the actual error).
We are using com.datastax.cassandra:cassandra-driver-core:jar:3.6.0
Any ideas/help to debug / resolve this will be highly appreciated

Could not get connection while getPartitionedTopicMetadata - io.netty.channel.ConnectTimeoutException: connection timed out

I have a basic Pulsar app, and when I try to connect to Pulsar, I get this exception:
2021-03-10 14:38:26.107 WARN 7 --- [r-client-io-1-1]
o.a.pulsar.client.impl.ConnectionPool : Failed to open connection
to my-pulsar-server-ms-tls.domain.com:6651 :
io.netty.channel.ConnectTimeoutException: connection timed out:
my-pulsar-server-ms-tls.domain.com/10.80.13.38:6651 2021-03-10
14:38:26.212 WARN 7 --- [al-listener-3-1]
o.a.pulsar.client.impl.PulsarClientImpl : [topic:
persistent://myTenant/myNamespace/myTopic]
Could not get connection while getPartitionedTopicMetadata -- Will try
again in 100 ms
My Pulsar client is pretty basic:
PulsarClient.builder()
.serviceUrl(serviceUrl)
.authentication(AuthenticationFactory.token(authToken))
.tlsTrustCertsFilePath(serverCertificateFilePath.toString())
.enableTlsHostnameVerification(false)
.allowTlsInsecureConnection(false)
.build();
The producer is also pretty basic and looks like this:
pulsarClient.newProducer(Schema.STRING)
.topic(topic)
.create();
I've verified that the token and TLS cert are correct. I've also tried connecting a consumer from this same environment and got a similar exception, and I know that others with the same code are able to connect to the same Pulsar cluster from other environments. What is the issue?
Your connection is getting blocked by a firewall or network issue.
Verify that you can establish a connection to your endpoint my-pulsar-server-ms-tls.domain.com:6651 from your environment.
If you're able to run a network packet dump (like tcpdump), that should make it obvious if you're not able to establish a connection.
You can also try running curl my-pulsar-server-ms-tls.domain.com:6651, and if you get back some html, that means you were able to reach the server. However, if you get Could not resolve host, then you were blocked by the network configuration (such as a missing route) or firewall.

How to find the first host contacted by cassandra driver during connection?

Is there a way to find out which node was contacted first during the initial setup by the driver? For example, is there a way to find the host 10.9.58.64 that was contacted?
WARNING:cassandra.cluster:Cluster.__init__ called with contact_points specified, but no load_balancing_policy. In the next major version, this will raise an error; please specify a load-balancing policy. (contact_points = ['cassandranode1,;cassandranode2'], lbp = None)
DEBUG:cassandra.cluster:Connecting to cluster, contact points: ['cassandranode1,;cassandranode2']; protocol version: 4
DEBUG:cassandra.io.asyncorereactor:Validated loop dispatch with cassandra.io.asyncorereactor._AsyncorePipeDispatcher
DEBUG:cassandra.pool:Host 10.9.58.64 is now marked up
DEBUG:cassandra.pool:Host 10.9.58.65 is now marked up
DEBUG:cassandra.cluster:[control connection] Opening new connection to 10.9.58.64
Right after the connection is established you can use the cluster.get_control_connection_host function to get information about host to which so-called control connection is established. It's used for administrative purposes, such as getting the updates on the status of the nodes in the cluster, etc. There is more information about control connection in the documentation of Java driver.

zookeeper connection timing out, kafa-spark streaming

I'm trying some exercise with spark streaming with kafka. If I use kafka producer and consumer in command line, I can publish and consume the messages in kafka. When I try to do it using pyspark in jupyter notebook. I am getting zookeeper connection timeout error.
Client session timed out, have not heard from server in 6004ms for sessionid 0x0, closing socket connection and attempting reconnect
[2017-08-04 15:49:37,494] INFO Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient#158da8e (org.apache.zookeeper.ZooKeeper)
[2017-08-04 15:49:37,524] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
[2017-08-04 15:49:37,527] INFO Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2017-08-04 15:49:37,533] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[2017-08-04 15:49:38,637] INFO Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2017-08-04 15:49:38,639] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
`
Zookeeper has issues when using localhost (127.0.0.1). Described in https://issues.apache.org/jira/browse/ZOOKEEPER-1661?focusedCommentId=13599352
This little program explains the following things:
ZooKeeper does call InetAddress.getAllByName (see StaticHostProvider:60) on the connect string "localhost:2181" => as a result it gets 3 different addresses for localhost which then get shuffled (Collections.shuffle(this.serverAddresses): L72
Because of the shuffling (random), the call to StaticHostProvider.next will sometime return the fe80:0:0:0:0:0:0:1%1 address which as you can see from this small program times out after 5s => this explains the randomness I am experiencing.
It really seems to me that what I am experiencing is a reverse dns lookup issue with IPv6. Whether this reverse dns lookup is actually useful and required by ZooKeeper, I do not know. It did not behave this way in 3.3.3.
Solution, specify your zookeeper address as a FQDN and make sure the reverse lookup works or use 0.0.0.0 instead of localhost.

Cassandra and defuncting connection

I've got a question about Cassandra. I haven't found any "understable answer" yet...
I made a cluster build on 3 nodes (RackInferringSnitch) on differents VM. I'm using Datastax's Java Driver to read and update my keyspace (with CSVs).
When one node is down (ie : 10.10.6.172), I've got this debug warning:
INFO 00:47:37,195 New Cassandra host /10.10.6.172:9042 added
INFO 00:47:37,246 New Cassandra host /10.10.6.122:9042 added
DEBUG 00:47:37,264 [Control connection] Refreshing schema
DEBUG 00:47:37,384 [Control connection] Successfully connected to /10.10.6.171:9042
DEBUG 00:47:37,391 Adding /10.10.6.172:9042 to list of queried hosts
DEBUG 00:47:37,395 Defuncting connection to /10.10.6.172:9042
com.datastax.driver.core.TransportException: [/10.10.6.172:9042] Channel has been closed
at com.datastax.driver.core.Connection$Dispatcher.channelClosed(Connection.java:621)
at
[...]
[...]
DEBUG 00:47:37,400 [/10.10.6.172:9042-1] Error connecting to /10.10.6.172:9042 (Connection refused: /10.10.6.172:9042)
DEBUG 00:47:37,407 Error creating pool to /10.10.6.172:9042 ([/10.10.6.172:9042] Cannot connect)
DEBUG 00:47:37,408 /10.10.6.172:9042 is down, scheduling connection retries
DEBUG 00:47:37,409 First reconnection scheduled in 1000ms
DEBUG 00:47:37,410 Adding /10.10.6.122:9042 to list of queried hosts
DEBUG 00:47:37,423 Adding /10.10.6.171:9042 to list of queried hosts
DEBUG 00:47:37,427 Adding /10.10.6.122:9042 to list of queried hosts
DEBUG 00:47:37,435 Shutting down pool
DEBUG 00:47:37,439 Adding /10.10.6.171:9042 to list of queried hosts
DEBUG 00:47:37,443 Shutting down pool
DEBUG 00:47:37,459 Connected to cluster: WormHole
I wanted to know if I need to handle this exception or it will be handled by itself (I mean, when the node will be back again cassandra will do the correct write if the batch was a write...)
EDIT : Current consistency level is ONE.
The DataStax driver keeps track of which nodes are available at all times and routes queries (load balacing) based on this information. The way it does this is based on your reconnection policy.
You will see debug level messages when nodes are detected as down, etc. This is no cause for concern as the driver will re-route to other available nodes, it will also re-try the nodes periodically to find out if they are back up. If you had a problem and the data was not getting saved to Cassandra you would see timeout errors. No action necessary in this case.

Resources