zookeeper connection timing out, kafa-spark streaming - apache-spark

I'm trying some exercise with spark streaming with kafka. If I use kafka producer and consumer in command line, I can publish and consume the messages in kafka. When I try to do it using pyspark in jupyter notebook. I am getting zookeeper connection timeout error.
Client session timed out, have not heard from server in 6004ms for sessionid 0x0, closing socket connection and attempting reconnect
[2017-08-04 15:49:37,494] INFO Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient#158da8e (org.apache.zookeeper.ZooKeeper)
[2017-08-04 15:49:37,524] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
[2017-08-04 15:49:37,527] INFO Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2017-08-04 15:49:37,533] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[2017-08-04 15:49:38,637] INFO Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2017-08-04 15:49:38,639] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
`

Zookeeper has issues when using localhost (127.0.0.1). Described in https://issues.apache.org/jira/browse/ZOOKEEPER-1661?focusedCommentId=13599352
This little program explains the following things:
ZooKeeper does call InetAddress.getAllByName (see StaticHostProvider:60) on the connect string "localhost:2181" => as a result it gets 3 different addresses for localhost which then get shuffled (Collections.shuffle(this.serverAddresses): L72
Because of the shuffling (random), the call to StaticHostProvider.next will sometime return the fe80:0:0:0:0:0:0:1%1 address which as you can see from this small program times out after 5s => this explains the randomness I am experiencing.
It really seems to me that what I am experiencing is a reverse dns lookup issue with IPv6. Whether this reverse dns lookup is actually useful and required by ZooKeeper, I do not know. It did not behave this way in 3.3.3.
Solution, specify your zookeeper address as a FQDN and make sure the reverse lookup works or use 0.0.0.0 instead of localhost.

Related

Could not get connection while getPartitionedTopicMetadata - io.netty.channel.ConnectTimeoutException: connection timed out

I have a basic Pulsar app, and when I try to connect to Pulsar, I get this exception:
2021-03-10 14:38:26.107 WARN 7 --- [r-client-io-1-1]
o.a.pulsar.client.impl.ConnectionPool : Failed to open connection
to my-pulsar-server-ms-tls.domain.com:6651 :
io.netty.channel.ConnectTimeoutException: connection timed out:
my-pulsar-server-ms-tls.domain.com/10.80.13.38:6651 2021-03-10
14:38:26.212 WARN 7 --- [al-listener-3-1]
o.a.pulsar.client.impl.PulsarClientImpl : [topic:
persistent://myTenant/myNamespace/myTopic]
Could not get connection while getPartitionedTopicMetadata -- Will try
again in 100 ms
My Pulsar client is pretty basic:
PulsarClient.builder()
.serviceUrl(serviceUrl)
.authentication(AuthenticationFactory.token(authToken))
.tlsTrustCertsFilePath(serverCertificateFilePath.toString())
.enableTlsHostnameVerification(false)
.allowTlsInsecureConnection(false)
.build();
The producer is also pretty basic and looks like this:
pulsarClient.newProducer(Schema.STRING)
.topic(topic)
.create();
I've verified that the token and TLS cert are correct. I've also tried connecting a consumer from this same environment and got a similar exception, and I know that others with the same code are able to connect to the same Pulsar cluster from other environments. What is the issue?
Your connection is getting blocked by a firewall or network issue.
Verify that you can establish a connection to your endpoint my-pulsar-server-ms-tls.domain.com:6651 from your environment.
If you're able to run a network packet dump (like tcpdump), that should make it obvious if you're not able to establish a connection.
You can also try running curl my-pulsar-server-ms-tls.domain.com:6651, and if you get back some html, that means you were able to reach the server. However, if you get Could not resolve host, then you were blocked by the network configuration (such as a missing route) or firewall.

Elasticsearch spark connection for structured-streaming

I'm trying to make a connection to elasticsearch from my spark program.
My elasticsearch host is https and found no connection property for that.
We are using spark structred streaming Java API and the connection details are as follows,
SparkSession spark = SparkSession.builder()
.config(ConfigurationOptions.ES_NET_HTTP_AUTH_USER, "username")
.config(ConfigurationOptions.ES_NET_HTTP_AUTH_PASS, "password")
.config(ConfigurationOptions.ES_NODES, "my_host_url")
.config(ConfigurationOptions.ES_PORT, "9200")
.config(ConfigurationOptions.ES_NET_SSL_TRUST_STORE_LOCATION,"C:\\certs\\elastic\\truststore.jks")
.config(ConfigurationOptions.ES_NET_SSL_TRUST_STORE_PASS,"my_password") .config(ConfigurationOptions.ES_NET_SSL_KEYSTORE_TYPE,"jks")
.master("local[2]")
.appName("spark_elastic").getOrCreate();
spark.conf().set("spark.sql.shuffle.partitions",2);
spark.conf().set("spark.default.parallelism",2);
And I'm getting the following error
19/07/01 12:26:00 INFO HttpMethodDirector: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server 10.xx.xxx.xxx failed to respond
19/07/01 12:26:00 INFO HttpMethodDirector: Retrying request
19/07/01 12:26:00 ERROR NetworkClient: Node [10.xx.xxx.xxx:9200] failed (The server 10.xx.xxx.xxx failed to respond); no other nodes left - aborting...
19/07/01 12:26:00 ERROR StpMain: Error
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:344)
Probably it's because it tries to initiate connection by http protocol but in my case I need https connection and not sure how to configure that
The error happened as spark was not able to locate the truststore file. It seems we need to add "file:\\" for the path to be accepted.

We are running a map reduce/spark job to bulk load hbase data in One of the environment

We are running a map reduce/spark job to bulk load hbase data in one of the environments.
While running it, connection to the hbase zookeeper cannot initialize throwing the following error.
16/05/10 06:36:10 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=c321shu.int.westgroup.com:2181,c149jub.int.westgroup.com:2181,c167rvm.int.westgroup.com:2181 sessionTimeout=90000 watcher=hconnection-0x74b47a30, quorum=c321shu.int.westgroup.com:2181,c149jub.int.westgroup.com:2181,c167rvm.int.westgroup.com:2181, baseZNode=/hbase
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Opening socket connection to server c321shu.int.westgroup.com/10.204.152.28:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.204.24.16:35740, server: c321shu.int.westgroup.com/10.204.152.28:2181
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Session establishment complete on server c321shu.int.westgroup.com/10.204.152.28:2181, sessionid = 0x5534bebb441bd3f, negotiated timeout = 60000
16/05/10 06:36:11 INFO mapreduce.HFileOutputFormat2: Looking up current regions for table ecpdevv1patents:NormNovusDemo
Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=35, exceptions:
Tue May 10 06:36:11 CDT 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller#3927df20, java.io.IOException: Call to c873gpv.int.westgroup.com/10.204.67.9:60020 failed on local exception: java.io.EOFException
We have executed the same job in Titan DEV too but facing the same problem. Please let us know if anyone has faced the same problem before.
Details are,
• Earlier job was failing to connect to localhost/127.0.0.1:2181. Hence only the property hbase.zookeeper.quorum has been set in map reduce code with c149jub.int.westgroup.com,c321shu.int.westgroup.com,c167rvm.int.westgroup.com which we got from hbase-site.xml.
• We are using jars of cdh version 5.3.3.

How to leave a Hazelcast cluster gracefully?

Currently, when I remove a node (e.g. ip-2) I simply call HazelcastInstance.shutdown(). But I still end up seeing a lot of warnings in the logs, e.g.
[ip-1]:5701 [xxx] [3.3.3] Removing connection to endpoint Address[ip-2]:5701 Cause => java.net.SocketException {Connection refused to address /ip-2:5701}, Error-Count: 5
[ip-1]:5701 [xxx] [3.3.3] This node does not have a connection to Member [ip-2]:5701
[ip-1]:5701 [xxx] [3.3.3] hz._hzInstance_1_xxx.IO.thread-in-0 Closing socket to endpoint Address[ip-2]:5701, Cause:java.io.EOFException: Remote socket closed!
Is there a more proper way to remove nodes from a cluster?
This is the recommended way. I guess the logging is a bit confusing.

WSO2 BAM wirh offset 1. Cassandra error

i have a problem on startup of BAM server.
My machine has the IP 1.33.33.127 and hostname "srv-lc-presen".
I it have configurated using this document:
Monitoring and statistics.
I have modified the at carbon.xml. I have it set to 1.
I've modified the master-datasources.xml and set
WSO2BAM_CASSANDRA_DATASOURCE url = jdbc:cassandra://srv-lc-presen:9161/EVENT_KS
WSO2BAM_UTIL_DATASOURCE url = jdbc:cassandra://srv-lc-presen:9161/BAM_UTIL_KS
I have tried with localhost, 1.33.33.127 and srv-lc-presen.
I always get the same error:
ERROR {me.prettyprint.cassandra.connection.HConnectionManager} - Could not start connection pool for host srv-lc-presen(1.33.33.127):9161
[2014-05-07 12:04:24,983] WARN {me.prettyprint.cassandra.connection.CassandraHostRetryService} - Downed srv-lc-presen(1.33.33.127):9161 host still appears to be down: Unable to open transport to srv-lc-presen(1.33.33.127):9161 , java.net.ConnectException: Connection refused
[2014-05-07 12:04:24,987] ERROR {org.wso2.carbon.bam.notification.task.internal.NotificationDispatchComponent} - All host pools marked down. Retry burden pushed out to client.
me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
at me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:393)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:249)
at me.prettyprint.cassandra.service.ThriftCluster.addKeyspace(ThriftCluster.java:168)
at org.wso2.carbon.bam.datasource.utils.DataSourceUtils.createKeyspaceIfNotExist(DataSourceUtils.java:80)
at org.wso2.carbon.bam.datasource.utils.DataSourceUtils.getClusterKeyspaceFromRDBMSConfig(DataSourceUtils.java:92)
at org.wso2.carbon.bam.datasource.utils.DataSourceUtils.getClusterKeyspaceFromRDBMSDataSource(DataSourceUtils.java:96)
NEW information
i have tried to reconfigure and i don't find the problem.
I see in BAM console this error
[2014-05-08 09:10:57,531] ERROR {me.prettyprint.cassandra.connection.HConnectionManager} - Could not start connection pool for host 1.33.33.127(1.33.33.127):9161
[2014-05-08 09:10:57,564] ERROR {org.wso2.carbon.bam.notification.task.internal.NotificationDispatchComponent} - All host pools marked down. Retry burden pushed out to client.
me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
at me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:393)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:249)
at me.prettyprint.cassandra.service.ThriftCluster.addKeyspace(ThriftCluster.java:168)
at org.wso2.carbon.bam.datasource.utils.DataSourceUtils.createKeyspaceIfNotExist(DataSourceUtils.java:80)
at org.wso2.carbon.bam.datasource.utils.DataSourceUtils.getClusterKeyspaceFromRDBMSConfig(DataSourceUtils.java:92)
at org.wso2.carbon.bam.datasource.utils.DataSourceUtils.getClusterKeyspaceFromRDBMSDataSource(DataSourceUtils.java:96)
at org.wso2.carbon.bam.notification.task.internal.NotificationDispatchComponent.initRecordStore(NotificationDispatchComponent.java:72)
at org.wso2.carbon.bam.notification.task.internal.NotificationDispatchComponent.activate(NotificationDispatchComponent.java:64)
And in API Manager console this
[2014-05-08 09:14:52,096] ERROR - ReceiverGroup No receiver is reachable at reconnection, can't publish the events
[2014-05-08 09:14:55,102] ERROR - AsyncDataPublisher Reconnection failed for for tcp://1.33.33.127:7612/
Please use this command at startup or edit wso2server.sh if you are not using notification feature sh wso2server.sh -Ddisable.notification.task
https://docs.wso2.org/display/BAM240/Notifications

Resources