KairosDB failed to discover other Cassandra nodes in ring with Hector client - cassandra

I have a multi-node Casssandra cluster (2.2.6) and a separate KairosDB server (1.1.1-1). In KairosDB, I configured it with two Cassandra seed nodes and have it auto discover other Cassandra nodes in the ring.
After tuning KairosDB log level to DEBUG, I see that only those two seed nodes are in host pool (and working well). Hector discovery process failed with an NPE. At the end only these two seed nodes are used by KairosDB.
There might be a few solutions:
Add all nodes to kairos properties, but it's harder to maintain.
Custom build a new KairosDB binary to have later version of Hector 2.0.0, but I prefer to go with official releases if possible.
Do you know a way to get around this? Thanks.
08-04|18:54:57.755 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:50] - Node discovery running...
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:74] - using existing hosts [cassandra-seed1(172.16.109.43):9160, cassandra-seed2(172.16.108.51):9160]
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] ERROR [NodeDiscovery.java:105] - Discovery Service failed attempt to connect CassandraHost
java.lang.NullPointerException: null
at me.prettyprint.cassandra.connection.NodeDiscovery.discoverNodes(NodeDiscovery.java:79) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeDiscovery.doAddNodes(NodeDiscovery.java:52) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeAutoDiscoverService.doAddNodes(NodeAutoDiscoverService.java:45) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeAutoDiscoverService$QueryRing.run(NodeAutoDiscoverService.java:51) [hector-core-1.1-4.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_101]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_101]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_101]
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:62] - Node discovery run complete.

Related

JanusGraph Error : "Could not find type for id" during a concurrent load operation

While performing a concurrent bulk load operation, I received this error. Subsequently, all my queries failed, and I kept getting the same error .
The exception I got is as follows:
java.lang.NullPointerException: Could not find type for id: 52237 at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:250) at org.janusgraph.graphdb.types.vertices.JanusGraphSchemaVertex.name(JanusGraphSchemaVertex.java:57) at org.janusgraph.graphdb.vertices.AbstractVertex.label(AbstractVertex.java:121) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceElement.(ReferenceElement.java:57) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceVertex.(ReferenceVertex.java:46) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceFactory.detach(ReferenceFactory.java:48) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceFactory.detach(ReferenceFactory.java:69) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceFactory.detach(ReferenceFactory.java:80) at org.apache.tinkerpop.gremlin.process.traversal.strategy.decoration.HaltedTraverserStrategy.halt(HaltedTraverserStrategy.java:60) at org.apache.tinkerpop.gremlin.server.util.TraverserIterator.next(TraverserIterator.java:64) at org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor.handleIterator(TraversalOpProcessor.java:529) at org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor.lambda$iterateBytecodeTraversal$4(TraversalOpProcessor.java:382) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Some additional context :
storage.batch-loading was NOT enabled
The bulk write operation I was running was highly concurrent and with high load
I used about 100 instances of gremlin server connecting to Cassandra/ES backend
I did not explicitly define a schema
Would be great if someone could give me an idea about what could have caused this .
Thanks !
it happens if multiple instance of gremlin-server are running
it is because gremlin server was not shutdown or killed properly.
it can be because the vm on which gremlin-server is running might have restarted.
so the solution is login to gremlin-console and run your commands based on your backend.in my case it's cassandra and elasticsearch
so i will run
method 1
:remote connect tinkerpop.server conf/remote.yaml session
:remote console session
or
graph=JanusGraphFactory.open('conf/janusgraph-cql-es.properties');
g=graph.traversal()
and if you are running containers then your command must be similar to this
graph=JanusGraphFactory.open('/etc/opt/janusgraph/janusgraph.properties');
g=graph.traversal()
now after running those you can run
mgmt = graph.openManagement()
mgmt.getOpenInstances()
it will display all the instances
eg
ac12000231-a9ffbcbb0e921
ac12000230-a9ffbcbb0e921(current)
except that current instance close other instances
mgmt.forceCloseInstance('ac12000231-a9ffbcbb0e921')
after closing all the instances commit the changes
mgmt.commit()
now restart your gremlin server and run your query it should work
method 2
if the problem persists just kill your gremlin-server and start it again few times...it should work
load command should work
another reason why this happens is if the data is not restored properly..
if you are using cluster take the backup on all the nodes
then restore on your destination node or nodes
i used nodetool for backup and sstableloader for restoring data

Apache Spark on k8s: securing RPC communication between driver and executors is not working

I have been trying Spark 2.4 deployment on k8s and want to establish a secured RPC communication channel between driver and executors. Was using the following configuration parameters as part of spark-submit
spark.authenticate true
spark.authenticate.secret good
spark.network.crypto.enabled true
spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA1
spark.network.crypto.saslFallback false
The driver and executors were not able to communicate on a secured channel and were throwing the following errors.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown challenge message.
at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:109)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:181)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
Can someone guide me on this?
Disclaimer: I do not have a very deep understanding of spark implementation, so, be careful when using the workaround described below.
AFAIK, spark does not have support for auth/encryption for k8s in 2.4.0 version.
There is a ticket, which is already fixed and likely will be released in a next spark version: https://issues.apache.org/jira/browse/SPARK-26239
The problem is that spark executors try to open connection to a driver, and a configuration will be sent only using this connection. Although, an executor creates the connection with default config AND system properties started with "spark.".
For reference, here is the place where executor opens the connection: https://github.com/apache/spark/blob/5fa4384/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L201
Theoretically, if you would set spark.executor.extraJavaOptions=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true ..., it should help, although driver checks that there are no spark parameters set in extraJavaOptions.
Although, there is a workaround (a little bit hacky): you can set spark.executorEnv.JAVA_TOOL_OPTIONS=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true .... Spark does not check this parameter, but JVM uses this env variable to add this parameter to properties.
Also, instead of using JAVA_TOOL_OPTIONS to pass secret, I would recommend to use spark.executorEnv._SPARK_AUTH_SECRET=<secret>.

Datastax driver connection exception DSE 5.0 , CASSANDRA 3.0.7 ,spark

I am trying to understand the warning, every time i am seeing the below exception when i run my spark job .I am seeing this in 2 nodes of my 3 node cluster.But as i said its just warn , job succeeds how ever.
com.datastax.driver.core.exceptions.ConnectionException: [x.x.x.x/x.x.x.x:9042] Pool was closed during initialization
CASSANDRA LOG
INFO [SharedPool-Worker-1] 2017-07-17 22:25:48,716 Message.java:605
- Unexpected exception during request; channel = [id: 0xf0ee1096, /x.x.x.x:54863 => /x.x.x.x:9042]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed:
Connection timed out
at io.netty.channel.unix.Errors.newIOException(Errors.java:105)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.unix.Errors.ioResult(Errors.java:121) ~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:134)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:239)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:822)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:348)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
The core of the error is "Connection timed out". I recommend troubleshooting network connectivity to the Cassandra cluster, starting with simpler tools such as ping, telnet and nc. Some potential causes:
The Cassandra client's connection configuration included an address that is not valid (not a node in the Cassandra cluster).
A network misconfiguration or firewall rule is preventing connections from the client to the Cassandra server.
The destination Cassandra server is overloaded, such that it cannot respond to new connection requests.
You mentioned that the problem is intermittent ("seeing this in 2 nodes of my 3 node cluster") and does not cause job failure. This could be an indicator that any of the problems listed above is happening for just a subset of nodes in the cluster. (If connectivity to all nodes was broken, then the job likely would have failed.)

Prevent logging of stacktrace for bad Cassandra contact point

When my Cassandra client program, which uses the Datastax Java driver, is given an invalid contact point (a hostname of a computer than is not actually running a Cassandra daemon) the driver itself logs a stacktrace. The stacktrace is worthless however, as there is a configuration error rather than a bug, and it is preceded by a much more informative warning message.
How can I configure the Cassandra driver not to trow an exception in this case, or configure logback not to log the stacktrace?
Here are the noisy log messages I get at present.:
2015-05-07 13:55:22,758 my-program: WARN You listed test-host-2.example.com/172.16.12.202:9042 in your contact points, but it could not be reached at startup
2015-05-07 13:55:22,919 my-program: WARN Some contact points don't match specified local data center. Local DC = DC1. Non-conforming contact points: /172.16.12.204:9042 (DC2)
2015-05-07 13:55:28,105 my-program: ERROR Error creating pool to test-host-2.example.com/172.16.12.202:9042
com.datastax.driver.core.TransportException: [test-host-2.example.com/172.16.12.202:9042] Cannot connect
at com.datastax.driver.core.Connection.(Connection.java:106) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.PooledConnection.(PooledConnection.java:32) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.Connection$Factory.open(Connection.java:521) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SingleConnectionPool.(SingleConnectionPool.java:76) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.HostConnectionPool.newInstance(HostConnectionPool.java:35) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager.replacePool(SessionManager.java:239) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager.access$400(SessionManager.java:39) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager$3.call(SessionManager.java:272) [my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager$3.call(SessionManager.java:264) [my-program-1.0.0.1.jar:1.0.0.1]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: org.jboss.netty.channel.ConnectTimeoutException: connection timed out: test-host-2.example.com/172.16.12.202:9042
at org.jboss.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137) ~[my-program-1.0.0.1.jar:1.0.0.1]
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83) ~[my-program-1.0.0.1.jar:1.0.0.1]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[my-program-1.0.0.1.jar:1.0.0.1]
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) ~[my-program-1.0.0.1.jar:1.0.0.1]
... 3 common frames omitted
This sounds like a feature request. Feel free to create a jira - https://datastax-oss.atlassian.net/secure/Dashboard.jspa
You could turn down logging but I don't think you want to exclude ERRORs or connection time outs.
Are you just bothered by the ERRORs in your logs? It may be useful to know when you have downed nodes that are contact points...
In the general case it makes sense to show the stack trace, it could be a different error (e.g. the server does have Cassandra running but authentication is enabled and you're not providing the right credentials).
If you really want to suppress stack traces in Logback, this is apparently possible with a custom layout.

All the TCP connection b/w DataStax driver to the Cassandra Remain in Active close state . i.e TIME_WAIT state.

The setup:
Web server
Apache Tomcat
RestFull web services
Using DataStax java driver 2.0
Database
-2-node Cassandra 2.0.7.31 cluster
-replicas=1
Problem
After sending set of 1500 request more than three times. I got error at the tomcat log
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.181.13.239 ([/10.181.13.239] Unexpected exception triggered))
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:64)
at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:214)
at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:169)
at com.jpmc.es.rtm.storage.impl.EventExtract.main(EventExtract.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.181.13.239 ([/10.181.13.239] Unexpected exception triggered))
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:98)
at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:165)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
Observation
After this state of tomcat. All the further request attaining the same fate. That is drivers are not able to send my insert request to cassandra.
After executing net stat command i find that the all the TCP connection b/w web server and the Cassandra are in TIMED_WAIT state.
What could be the reason ? why Datastax driver is not able to take back the connection back to the pool? or why does the Cassandra is engaging all the connection form its client.
Thanks in Advance
The connection was increasing Due to calling creating multiple session for each request. Now it is working Fine.
builder = new Cluster.Builder().
addContactPoints("192.168.114.42");
builder.withPoolingOptions(new PoolingOptions().setCoreConnectionsPerHost(
HostDistance.LOCAL, new PoolingOptions().getMaxConnectionsPerHost(HostDistance.LOCAL)));
cluster = builder
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
.build();
session = cluster.connect("demodb");
Now Driver is maintain 17-26 number of connection irrespective of number of transaction.

Resources