Prevent logging of stacktrace for bad Cassandra contact point - cassandra

When my Cassandra client program, which uses the Datastax Java driver, is given an invalid contact point (a hostname of a computer than is not actually running a Cassandra daemon) the driver itself logs a stacktrace. The stacktrace is worthless however, as there is a configuration error rather than a bug, and it is preceded by a much more informative warning message.
How can I configure the Cassandra driver not to trow an exception in this case, or configure logback not to log the stacktrace?
Here are the noisy log messages I get at present.:
2015-05-07 13:55:22,758 my-program: WARN You listed test-host-2.example.com/172.16.12.202:9042 in your contact points, but it could not be reached at startup
2015-05-07 13:55:22,919 my-program: WARN Some contact points don't match specified local data center. Local DC = DC1. Non-conforming contact points: /172.16.12.204:9042 (DC2)
2015-05-07 13:55:28,105 my-program: ERROR Error creating pool to test-host-2.example.com/172.16.12.202:9042
com.datastax.driver.core.TransportException: [test-host-2.example.com/172.16.12.202:9042] Cannot connect
at com.datastax.driver.core.Connection.(Connection.java:106) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.PooledConnection.(PooledConnection.java:32) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.Connection$Factory.open(Connection.java:521) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SingleConnectionPool.(SingleConnectionPool.java:76) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.HostConnectionPool.newInstance(HostConnectionPool.java:35) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager.replacePool(SessionManager.java:239) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager.access$400(SessionManager.java:39) ~[my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager$3.call(SessionManager.java:272) [my-program-1.0.0.1.jar:1.0.0.1]
at com.datastax.driver.core.SessionManager$3.call(SessionManager.java:264) [my-program-1.0.0.1.jar:1.0.0.1]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: org.jboss.netty.channel.ConnectTimeoutException: connection timed out: test-host-2.example.com/172.16.12.202:9042
at org.jboss.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137) ~[my-program-1.0.0.1.jar:1.0.0.1]
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83) ~[my-program-1.0.0.1.jar:1.0.0.1]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[my-program-1.0.0.1.jar:1.0.0.1]
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) ~[my-program-1.0.0.1.jar:1.0.0.1]
... 3 common frames omitted

This sounds like a feature request. Feel free to create a jira - https://datastax-oss.atlassian.net/secure/Dashboard.jspa
You could turn down logging but I don't think you want to exclude ERRORs or connection time outs.
Are you just bothered by the ERRORs in your logs? It may be useful to know when you have downed nodes that are contact points...

In the general case it makes sense to show the stack trace, it could be a different error (e.g. the server does have Cassandra running but authentication is enabled and you're not providing the right credentials).
If you really want to suppress stack traces in Logback, this is apparently possible with a custom layout.

Related

Apache Spark on k8s: securing RPC communication between driver and executors is not working

I have been trying Spark 2.4 deployment on k8s and want to establish a secured RPC communication channel between driver and executors. Was using the following configuration parameters as part of spark-submit
spark.authenticate true
spark.authenticate.secret good
spark.network.crypto.enabled true
spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA1
spark.network.crypto.saslFallback false
The driver and executors were not able to communicate on a secured channel and were throwing the following errors.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown challenge message.
at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:109)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:181)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
Can someone guide me on this?
Disclaimer: I do not have a very deep understanding of spark implementation, so, be careful when using the workaround described below.
AFAIK, spark does not have support for auth/encryption for k8s in 2.4.0 version.
There is a ticket, which is already fixed and likely will be released in a next spark version: https://issues.apache.org/jira/browse/SPARK-26239
The problem is that spark executors try to open connection to a driver, and a configuration will be sent only using this connection. Although, an executor creates the connection with default config AND system properties started with "spark.".
For reference, here is the place where executor opens the connection: https://github.com/apache/spark/blob/5fa4384/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L201
Theoretically, if you would set spark.executor.extraJavaOptions=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true ..., it should help, although driver checks that there are no spark parameters set in extraJavaOptions.
Although, there is a workaround (a little bit hacky): you can set spark.executorEnv.JAVA_TOOL_OPTIONS=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true .... Spark does not check this parameter, but JVM uses this env variable to add this parameter to properties.
Also, instead of using JAVA_TOOL_OPTIONS to pass secret, I would recommend to use spark.executorEnv._SPARK_AUTH_SECRET=<secret>.

getOrCreate deployment failing randomly

When attempting to call H2OContext.getOrCreate with a valid SparkContext, randomly we keep seeing failures to deploy:
17/04/21 17:21:32 ERROR TaskSchedulerImpl: Lost executor 0 on 172.17.0.4: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/21 17:21:38 ERROR LiveListenerBus: Listener ExecutorAddNotSupportedListener threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.listeners.ExecutorAddNotSupportedListener.onExecutorAdded(H2OSparkListener.scala:27)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:61)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1252)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
The H2OContext.getOrCreate causes the error:
Context.spark_session = SparkSession.builder.getOrCreate()
Context.h2o_context = H2OContext.getOrCreate(Context.spark_session)
Any thoughts from the H2O Crew?
this is a known behaviour of Sparkling Water internal backend at the moment. To avoid this, the external Sparkling Water backend can be used. More information about this can be found here https://github.com/h2oai/sparkling-water/blob/master/doc/backends.md
I'm currently working on this JIRA which should eliminate the behaviour above as well. It's work in progress, this JIRA https://0xdata.atlassian.net/browse/SW-369 can be tracked to get the status of the task.

KairosDB failed to discover other Cassandra nodes in ring with Hector client

I have a multi-node Casssandra cluster (2.2.6) and a separate KairosDB server (1.1.1-1). In KairosDB, I configured it with two Cassandra seed nodes and have it auto discover other Cassandra nodes in the ring.
After tuning KairosDB log level to DEBUG, I see that only those two seed nodes are in host pool (and working well). Hector discovery process failed with an NPE. At the end only these two seed nodes are used by KairosDB.
There might be a few solutions:
Add all nodes to kairos properties, but it's harder to maintain.
Custom build a new KairosDB binary to have later version of Hector 2.0.0, but I prefer to go with official releases if possible.
Do you know a way to get around this? Thanks.
08-04|18:54:57.755 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:50] - Node discovery running...
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:74] - using existing hosts [cassandra-seed1(172.16.109.43):9160, cassandra-seed2(172.16.108.51):9160]
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] ERROR [NodeDiscovery.java:105] - Discovery Service failed attempt to connect CassandraHost
java.lang.NullPointerException: null
at me.prettyprint.cassandra.connection.NodeDiscovery.discoverNodes(NodeDiscovery.java:79) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeDiscovery.doAddNodes(NodeDiscovery.java:52) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeAutoDiscoverService.doAddNodes(NodeAutoDiscoverService.java:45) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeAutoDiscoverService$QueryRing.run(NodeAutoDiscoverService.java:51) [hector-core-1.1-4.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_101]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_101]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_101]
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:62] - Node discovery run complete.

Accumulo's createtable command gets stuck and does not create a table

I was trying to create a table inside Accumulo using the createtable command and found out that it was getting stuck. I waited for around 20 mins before cancelling the createtable command.
createtable test_table
I have one master and 2 tablet servers and found out that my master and one of the tablets died. I could not telnet to port 9997 of that particular tablet server and I could not even telnet to port 29999 (master.port.client in accumulo-site.xml). When I saw the tserver logs of the dead server, I saw the following entries.
2016-05-10 02:12:07,456 [zookeeper.DistributedWorkQueue] INFO : Got unexpected z
ookeeper event: None for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/recovery
2016-05-10 02:12:23,883 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.accumulo.fate.zookeeper.ZooCache$1.run(ZooCache.java:210)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.getChildren(ZooCache.java
:221)
at org.apache.accumulo.core.client.impl.Tables.exists(Tables.java:142)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.tab
leExists(LargestFirstMemoryManager.java:149)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.get
MemoryManagementActions(LargestFirstMemoryManager.java:175)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.manageMemory(TabletServerResourceManager.java:408)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.access$400(TabletServerResourceManager.java:318)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework$2.run(TabletServerResourceManager.java:346)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.jav
a:35)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,884 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables/!0/con
f/table.classpath.context
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:117)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:103)
at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfigura
tion.java:99)
at org.apache.accumulo.tserver.constraints.ConstraintChecker.classLoader
Changed(ConstraintChecker.java:93)
at org.apache.accumulo.tserver.tablet.Tablet.checkConstraints(Tablet.jav
a:1225)
at org.apache.accumulo.tserver.TabletServer$8.run(TabletServer.java:2848
)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51
1)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,887 [zookeeper.ZooReader] WARN : Saw (possibly) transient ex
ception communicating with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tservers/accu
mulo.tablet.2:9997
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java
:132)
at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:383)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.j
ava:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Changing monitor lo
g4j address to accumulo.master:4560
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Enabled log-forward
ing
Even the master server's logs had the same stacktrace. My zookeeper is running.
At first, I thought it was a disk issue. Maybe there was no space. But that was not the case. I ran the fsck on the accumulo instance.volumes and it returned the HEALTHY status.
Does anyone know what exactly happened and if possible, how to avoid it?
EDIT : Even the tracer_accumulo.master.log had the same stacktrace.
ZooKeeper session expirations occur when a thread inside the ZooKeeper client does not get run within the necessary time (by default, 30s) to maintain the session which is an in-memory state between ZooKeeper client and server. There is no single explanation for this, but many common culprits:
JVM garbage collection pauses in the client. Accumulo should log a warning if it experienced a pause.
Lack of CPU time. If the host itself is overburdened, Accumulo might not have the cycles to run all of the tasks it needs to in a timely manner.
Lack of sockets/filehandles, Accumulo could be trying to connect to ZooKeeper, but be unable to open new connections
ZooKeeper might be rate-limiting connections as a denial-of-service prevention. Check the zookeeper logs for errors about dropping/denying new connections from a specific IP, and, if you see these errors, consider increasing maxClientCnxns in zoo.cfg.

Does Pelops and Hector support IPv6?

I am using pelops to retrieve data from cassandra cluster which has all its servers running on IPv6.
While running this program getting following error.
Exception in thread "main" java.lang.RuntimeException: exception while checking if MBean is registered, com.scale7.cassandra.pelops.pool:type=PooledNode-testkeyspace-2001:1c11:90:111:2:6:8:10
at org.scale7.cassandra.pelops.JmxMBeanManager.isRegistered(JmxMBeanManager.java:58)
at org.scale7.cassandra.pelops.pool.PooledNode.<init>(PooledNode.java:66)
at org.scale7.cassandra.pelops.pool.CommonsBackedPool.addNode(CommonsBackedPool.java:415)
at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:137)
at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:88)
at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:76)
at org.scale7.cassandra.pelops.Pelops.addPool(Pelops.java:48)
at com.opera.osp.client.CassandraClient.<init>(Unknown Source)
at com.opera.osp.validation.OSPDataValidator.main(Unknown Source)
Caused by: javax.management.MalformedObjectNameException: Invalid character ':' in value part of property
at javax.management.ObjectName.construct(ObjectName.java:602)
at javax.management.ObjectName.<init>(ObjectName.java:1403)
at org.scale7.cassandra.pelops.JmxMBeanManager.isRegistered(JmxMBeanManager.java:54)
... 8 more
Does pelops have support for IPv6.If not I am planning to migrate to Hector but does Hector has this support either?
It looks like an issue unrelated to IPv6, if you use the IPv6 address in the MBean's name, escape it (for example replace it with "_").
I would assume they otherwise support IPv6 because Java does and they must be using Java's network API.

Resources