Datastax driver connection exception DSE 5.0 , CASSANDRA 3.0.7 ,spark - cassandra

I am trying to understand the warning, every time i am seeing the below exception when i run my spark job .I am seeing this in 2 nodes of my 3 node cluster.But as i said its just warn , job succeeds how ever.
com.datastax.driver.core.exceptions.ConnectionException: [x.x.x.x/x.x.x.x:9042] Pool was closed during initialization
CASSANDRA LOG
INFO [SharedPool-Worker-1] 2017-07-17 22:25:48,716 Message.java:605
- Unexpected exception during request; channel = [id: 0xf0ee1096, /x.x.x.x:54863 => /x.x.x.x:9042]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed:
Connection timed out
at io.netty.channel.unix.Errors.newIOException(Errors.java:105)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.unix.Errors.ioResult(Errors.java:121) ~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:134)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:239)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:822)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:348)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]

The core of the error is "Connection timed out". I recommend troubleshooting network connectivity to the Cassandra cluster, starting with simpler tools such as ping, telnet and nc. Some potential causes:
The Cassandra client's connection configuration included an address that is not valid (not a node in the Cassandra cluster).
A network misconfiguration or firewall rule is preventing connections from the client to the Cassandra server.
The destination Cassandra server is overloaded, such that it cannot respond to new connection requests.
You mentioned that the problem is intermittent ("seeing this in 2 nodes of my 3 node cluster") and does not cause job failure. This could be an indicator that any of the problems listed above is happening for just a subset of nodes in the cluster. (If connectivity to all nodes was broken, then the job likely would have failed.)

Related

KairosDB failed to discover other Cassandra nodes in ring with Hector client

I have a multi-node Casssandra cluster (2.2.6) and a separate KairosDB server (1.1.1-1). In KairosDB, I configured it with two Cassandra seed nodes and have it auto discover other Cassandra nodes in the ring.
After tuning KairosDB log level to DEBUG, I see that only those two seed nodes are in host pool (and working well). Hector discovery process failed with an NPE. At the end only these two seed nodes are used by KairosDB.
There might be a few solutions:
Add all nodes to kairos properties, but it's harder to maintain.
Custom build a new KairosDB binary to have later version of Hector 2.0.0, but I prefer to go with official releases if possible.
Do you know a way to get around this? Thanks.
08-04|18:54:57.755 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:50] - Node discovery running...
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:74] - using existing hosts [cassandra-seed1(172.16.109.43):9160, cassandra-seed2(172.16.108.51):9160]
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] ERROR [NodeDiscovery.java:105] - Discovery Service failed attempt to connect CassandraHost
java.lang.NullPointerException: null
at me.prettyprint.cassandra.connection.NodeDiscovery.discoverNodes(NodeDiscovery.java:79) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeDiscovery.doAddNodes(NodeDiscovery.java:52) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeAutoDiscoverService.doAddNodes(NodeAutoDiscoverService.java:45) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeAutoDiscoverService$QueryRing.run(NodeAutoDiscoverService.java:51) [hector-core-1.1-4.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_101]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_101]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_101]
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:62] - Node discovery run complete.

We are running a map reduce/spark job to bulk load hbase data in One of the environment

We are running a map reduce/spark job to bulk load hbase data in one of the environments.
While running it, connection to the hbase zookeeper cannot initialize throwing the following error.
16/05/10 06:36:10 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=c321shu.int.westgroup.com:2181,c149jub.int.westgroup.com:2181,c167rvm.int.westgroup.com:2181 sessionTimeout=90000 watcher=hconnection-0x74b47a30, quorum=c321shu.int.westgroup.com:2181,c149jub.int.westgroup.com:2181,c167rvm.int.westgroup.com:2181, baseZNode=/hbase
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Opening socket connection to server c321shu.int.westgroup.com/10.204.152.28:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.204.24.16:35740, server: c321shu.int.westgroup.com/10.204.152.28:2181
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Session establishment complete on server c321shu.int.westgroup.com/10.204.152.28:2181, sessionid = 0x5534bebb441bd3f, negotiated timeout = 60000
16/05/10 06:36:11 INFO mapreduce.HFileOutputFormat2: Looking up current regions for table ecpdevv1patents:NormNovusDemo
Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=35, exceptions:
Tue May 10 06:36:11 CDT 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller#3927df20, java.io.IOException: Call to c873gpv.int.westgroup.com/10.204.67.9:60020 failed on local exception: java.io.EOFException
We have executed the same job in Titan DEV too but facing the same problem. Please let us know if anyone has faced the same problem before.
Details are,
• Earlier job was failing to connect to localhost/127.0.0.1:2181. Hence only the property hbase.zookeeper.quorum has been set in map reduce code with c149jub.int.westgroup.com,c321shu.int.westgroup.com,c167rvm.int.westgroup.com which we got from hbase-site.xml.
• We are using jars of cdh version 5.3.3.

Accumulo's createtable command gets stuck and does not create a table

I was trying to create a table inside Accumulo using the createtable command and found out that it was getting stuck. I waited for around 20 mins before cancelling the createtable command.
createtable test_table
I have one master and 2 tablet servers and found out that my master and one of the tablets died. I could not telnet to port 9997 of that particular tablet server and I could not even telnet to port 29999 (master.port.client in accumulo-site.xml). When I saw the tserver logs of the dead server, I saw the following entries.
2016-05-10 02:12:07,456 [zookeeper.DistributedWorkQueue] INFO : Got unexpected z
ookeeper event: None for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/recovery
2016-05-10 02:12:23,883 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.accumulo.fate.zookeeper.ZooCache$1.run(ZooCache.java:210)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.getChildren(ZooCache.java
:221)
at org.apache.accumulo.core.client.impl.Tables.exists(Tables.java:142)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.tab
leExists(LargestFirstMemoryManager.java:149)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.get
MemoryManagementActions(LargestFirstMemoryManager.java:175)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.manageMemory(TabletServerResourceManager.java:408)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.access$400(TabletServerResourceManager.java:318)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework$2.run(TabletServerResourceManager.java:346)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.jav
a:35)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,884 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables/!0/con
f/table.classpath.context
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:117)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:103)
at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfigura
tion.java:99)
at org.apache.accumulo.tserver.constraints.ConstraintChecker.classLoader
Changed(ConstraintChecker.java:93)
at org.apache.accumulo.tserver.tablet.Tablet.checkConstraints(Tablet.jav
a:1225)
at org.apache.accumulo.tserver.TabletServer$8.run(TabletServer.java:2848
)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51
1)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,887 [zookeeper.ZooReader] WARN : Saw (possibly) transient ex
ception communicating with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tservers/accu
mulo.tablet.2:9997
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java
:132)
at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:383)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.j
ava:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Changing monitor lo
g4j address to accumulo.master:4560
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Enabled log-forward
ing
Even the master server's logs had the same stacktrace. My zookeeper is running.
At first, I thought it was a disk issue. Maybe there was no space. But that was not the case. I ran the fsck on the accumulo instance.volumes and it returned the HEALTHY status.
Does anyone know what exactly happened and if possible, how to avoid it?
EDIT : Even the tracer_accumulo.master.log had the same stacktrace.
ZooKeeper session expirations occur when a thread inside the ZooKeeper client does not get run within the necessary time (by default, 30s) to maintain the session which is an in-memory state between ZooKeeper client and server. There is no single explanation for this, but many common culprits:
JVM garbage collection pauses in the client. Accumulo should log a warning if it experienced a pause.
Lack of CPU time. If the host itself is overburdened, Accumulo might not have the cycles to run all of the tasks it needs to in a timely manner.
Lack of sockets/filehandles, Accumulo could be trying to connect to ZooKeeper, but be unable to open new connections
ZooKeeper might be rate-limiting connections as a denial-of-service prevention. Check the zookeeper logs for errors about dropping/denying new connections from a specific IP, and, if you see these errors, consider increasing maxClientCnxns in zoo.cfg.

Prevent logging of stacktrace for bad Cassandra contact point

When my Cassandra client program, which uses the Datastax Java driver, is given an invalid contact point (a hostname of a computer than is not actually running a Cassandra daemon) the driver itself logs a stacktrace. The stacktrace is worthless however, as there is a configuration error rather than a bug, and it is preceded by a much more informative warning message.
How can I configure the Cassandra driver not to trow an exception in this case, or configure logback not to log the stacktrace?
Here are the noisy log messages I get at present.:
2015-05-07 13:55:22,758 my-program: WARN You listed test-host-2.example.com/172.16.12.202:9042 in your contact points, but it could not be reached at startup
2015-05-07 13:55:22,919 my-program: WARN Some contact points don't match specified local data center. Local DC = DC1. Non-conforming contact points: /172.16.12.204:9042 (DC2)
2015-05-07 13:55:28,105 my-program: ERROR Error creating pool to test-host-2.example.com/172.16.12.202:9042
com.datastax.driver.core.TransportException: [test-host-2.example.com/172.16.12.202:9042] Cannot connect
at com.datastax.driver.core.Connection.(Connection.java:106) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.PooledConnection.(PooledConnection.java:32) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.Connection$Factory.open(Connection.java:521) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SingleConnectionPool.(SingleConnectionPool.java:76) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.HostConnectionPool.newInstance(HostConnectionPool.java:35) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager.replacePool(SessionManager.java:239) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager.access$400(SessionManager.java:39) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager$3.call(SessionManager.java:272) [my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager$3.call(SessionManager.java:264) [my-program-1.0.0.1.jar:1.0.0.1]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: org.jboss.netty.channel.ConnectTimeoutException: connection timed out: test-host-2.example.com/172.16.12.202:9042
at org.jboss.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137) ~[my-program-1.0.0.1.jar:1.0.0.1]
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83) ~[my-program-1.0.0.1.jar:1.0.0.1]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[my-program-1.0.0.1.jar:1.0.0.1]
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) ~[my-program-1.0.0.1.jar:1.0.0.1]
... 3 common frames omitted
This sounds like a feature request. Feel free to create a jira - https://datastax-oss.atlassian.net/secure/Dashboard.jspa
You could turn down logging but I don't think you want to exclude ERRORs or connection time outs.
Are you just bothered by the ERRORs in your logs? It may be useful to know when you have downed nodes that are contact points...
In the general case it makes sense to show the stack trace, it could be a different error (e.g. the server does have Cassandra running but authentication is enabled and you're not providing the right credentials).
If you really want to suppress stack traces in Logback, this is apparently possible with a custom layout.

All the TCP connection b/w DataStax driver to the Cassandra Remain in Active close state . i.e TIME_WAIT state.

The setup:
Web server
Apache Tomcat
RestFull web services
Using DataStax java driver 2.0
Database
-2-node Cassandra 2.0.7.31 cluster
-replicas=1
Problem
After sending set of 1500 request more than three times. I got error at the tomcat log
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.181.13.239 ([/10.181.13.239] Unexpected exception triggered))
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:64)
at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:214)
at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:169)
at com.jpmc.es.rtm.storage.impl.EventExtract.main(EventExtract.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.181.13.239 ([/10.181.13.239] Unexpected exception triggered))
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:98)
at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:165)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
Observation
After this state of tomcat. All the further request attaining the same fate. That is drivers are not able to send my insert request to cassandra.
After executing net stat command i find that the all the TCP connection b/w web server and the Cassandra are in TIMED_WAIT state.
What could be the reason ? why Datastax driver is not able to take back the connection back to the pool? or why does the Cassandra is engaging all the connection form its client.
Thanks in Advance
The connection was increasing Due to calling creating multiple session for each request. Now it is working Fine.
builder = new Cluster.Builder().
addContactPoints("192.168.114.42");
builder.withPoolingOptions(new PoolingOptions().setCoreConnectionsPerHost(
HostDistance.LOCAL, new PoolingOptions().getMaxConnectionsPerHost(HostDistance.LOCAL)));
cluster = builder
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
.build();
session = cluster.connect("demodb");
Now Driver is maintain 17-26 number of connection irrespective of number of transaction.

Resources