Cassandra Datastax OpsCenter "NO DATA" on some graphs - cassandra

I'm running on 1 node test cluster Cassandra Datastax OpsCenter 5.2.0, installed from Amazon Datastax AMI version 2.6.3 with Cassandra Community 2.2.0-1.
OpsCenter doesn't report any errors (All agents connected) yet on some graphs I see NO DATA (while I know that there been a lot of requests):
On some there is nothing:
some are working just fine, like OS: Load, Storage Capacity and OS: Disk Utilization.
What could be the reason for this? How to fix it?
EDIT:
Opscenter logs seems to be fine. In agent.log file, I've found the following errors (dozens of times):
ERROR [jmx-metrics-2] 2015-09-21 06:50:30,910 Error getting CF metrics
java.lang.UnsupportedOperationException: nth not supported on this type: PersistentArrayMap
at clojure.lang.RT.nthFrom(RT.java:857)
at clojure.lang.RT.nth(RT.java:807)
at opsagent.rollup$process_metric_map.invoke(rollup.clj:252)
at opsagent.metrics.jmx$cf_metric_helper.invoke(jmx.clj:96)
at opsagent.metrics.jmx$start_pool$fn__15320.invoke(jmx.clj:159)
at clojure.lang.AFn.run(AFn.java:24)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
ERROR [jmx-metrics-4] 2015-09-21 06:50:38,524 Error getting general metrics
java.lang.UnsupportedOperationException: nth not supported on this type: PersistentHashMap
at clojure.lang.RT.nthFrom(RT.java:857)
at clojure.lang.RT.nth(RT.java:807)
at opsagent.rollup$process_metric_map.invoke(rollup.clj:252)
at opsagent.metrics.jmx$generic_metric_helper.invoke(jmx.clj:73)
at opsagent.metrics.jmx$start_pool$fn__15334$fn__15335.invoke(jmx.clj:171)
at opsagent.metrics.jmx$start_pool$fn__15334.invoke(jmx.clj:170)
at clojure.lang.AFn.run(AFn.java:24)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
BTW. In Datastax agent configuration file (address.yaml), I have only stomp_interface parameter set to my node IP.

NO DATA in this case was caused by bug in Datastax agent. I've included related error in my question. The same error is also mentioned here: Opsagent UnsupportedOperationException with PersistentHashMap
After upgrading from Datastax Opscenter 5.2.0 to 5.2.1 and upgrading agent to the same version, problem went away.
Thank you #phact for our help!

I had a similar error before and it was due to a wrongly configured address.yaml.
In my case the issue was that I only wrote a simple IP address in the hosts setting and it was fixed by using an array syntax instead. It seemed that without that, the agent was using a wrong data type, which then issued an UnsupportedOperationException. Try to add your local interface IP (or whatever IP your cassandra node is listening on on your machine) to the hosts setting like so in address.yaml:
hosts: ["<local-node-ip>"]
Maybe also try adding the following settings to address.yaml, just to be sure that the correct parameters are used by the agent:
rpc_address: <local-node-ip>
agent_rpc_interface: <local-node-ip>

Related

NoSuchMethodError: org.apache.cassandra.db.ColumnFamilyStore.getOverlappingSSTables

I have upgraded one of cluster node from 2.2.19 to 3.11.13, but I'm continuously getting the below error in system logs. I'm using TimeWindowCompactionStrategy-3.7.jar
Please let me know how can I fix this error ?
ERROR [CompactionExecutor:2338] 2022-09-12 14:40:41,310 CassandraDaemon.java:244 - Exception in thread Thread[CompactionExecutor:2338,1,main]
java.lang.NoSuchMethodError: org.apache.cassandra.db.ColumnFamilyStore.getOverlappingSSTables(Lorg/apache/cassandra/db/lifecycle/SSTableSet;Ljava/lang/Iterable;)Ljava/util/Collection;
at com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy.getNextBackgroundSSTables(TimeWindowCompactionStrategy.java:110)
at com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy.getNextBackgroundTask(TimeWindowCompactionStrategy.java:79)
at org.apache.cassandra.db.compaction.CompactionStrategyManager.getNextBackgroundTask(CompactionStrategyManager.java:154)
at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
at java.lang.Thread.run(Thread.java:750)
TimeWindowCompactionStrategy has been merged into Apache Cassandra 3.11.13, so you shouldn't need to include the JAR for it. Remove the JAR file and restart the node(s).
Edit
Ok, after a quick conversation with Jeff, he has two suggestions:
The 3.7 jar won't be compatible with 3.11. So issue the ALTER TABLE syntax on the 3.11 node, which will use the TWCS version bundled with 3.11. It'll not propagate to the 2.2 hosts (because schema changes won't cross major versions).
You'll be in a schema disagreement state, but that should be ok until the upgrade is complete. Give that a try in a lower environment, just to make sure it works.
The other option is to take the version of TWCS from 3.11, rename it with the right classpath to use com.jeffjirsa, and just use that instead.
Edit
Protocol exception with client networking: org.apache.cassandra.transport.ProtocolException: Invalid or unsupported protocol version (4); supported versions are (3/v3, 4/v4, 5/v5-beta)
Is the error due to the mixed versions in the cluster ?
Yes! I've seen that happen before. You can actually force a protocol version in the driver's connection settings.
Best of luck!

Spark 3.2 on Kubernetes keeps throwing okhttp3/okio EOFException

I'm using Spark 3.2.1 image that was built from the official distribution via `docker-image-tool.sh', on Kubernetes 1.18 cluster. Everything works fine, except for this error message every 90 seconds:
WARN WatcherWebSocketListener: Exec Failure
java.io.EOFException
at okio.RealBufferedSource.require(RealBufferedSource.java:61)
at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This error message does not effect the application, but it's really annoying, especially for Jupyter users, and the lack of details makes it very hard to debug.
It appears on any submit variation - spark-submit, pyspark, spark-shell, and regardless to dynamic execution enabled or disabled.
I've found traces of it on the internet, but all occurrences were from older versions of Spark and resolved by using "newer" version of fabric8 (4.x). Spark 3.2.1 already use fabric8 5.4.1.
I wonder if anyone else still sees this error in Spark 3.x, and has a resolution.
Thanks.
Update:
This seems to be related to the Kubernetes cluster itself. After migrating to a new cluster this error was gone.

Flink-Cassandra connector throws exception (flink-connector-cassandra_2.11-1.10.0)

I am trying to upgrade flink 1.7.2 to flink 1.10 and I am having problem with cassandra connector. Everytime I start a job that is using it the following exception is thrown:
com.datastax.driver.core.exceptions.TransportException: [/xx.xx.xx.xx] Error writing
at com.datastax.driver.core.Connection$10.operationComplete(Connection.java:550)
at com.datastax.driver.core.Connection$10.operationComplete(Connection.java:534)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138)
at com.datastax.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
at com.datastax.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
at com.datastax.driver.core.Connection$Flusher.run(Connection.java:870)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.shaded.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Direct buffer memory
at com.datastax.shaded.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:643)
Also the following message was printed when the job was run locally (not in YARN):
13:57:54,490 ERROR com.datastax.shaded.netty.util.ResourceLeakDetector - LEAK: You are creating too many HashedWheelTimer instances. HashedWheelTimer is a shared resource that must be reused across the JVM,so that only a few instances are created.
All jobs that do not use cassandra connector are working properly
Can someone help?
UPDATE: The bug is still reproducible and I think this is the reason: https://issues.apache.org/jira/browse/FLINK-17493.
I had an old configuration (from flink 1.7) where classloader.parent-first-patterns.additional: com.datastax. was configured and my cassadndra-flink connector was in flink/lib folder ( this was done because of other problems related to shaded netty I had with Cassandra-flink connector). Now with the migration to flink 1.10 the following problem was hit. Once removing this configuration - classloader.parent-first-patterns.additional: com.datastax., including flink-connector-cassandra_2.12-1.10.0.jar in my jar and removing it from /usr/lib/flink/lib/ the problem was no longer reproducible.

Error while saving the Spark RDD to Cassandra

We are trying to save our RDD which will have close to 4 billion rows to Cassandra. While some of the data gets persisted but for some partitions we see these error logs in the spark logs.
We have already set these two properties for cassandra connector. Is there some other optimization that we would need to do? Also what are the recommended settings for reader? We have left them as default.
spark.cassandra.output.batch.size.rows=1
spark.cassandra.output.concurrent.writes=1
We are running spark-1.1.0 and spark-cassandra-connector-java_2.10 v 2.1.0
15/01/08 05:32:44 ERROR QueryExecutor: Failed to execute: com.datastax.driver.core.BoundStatement#3f480b4e
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.87.33.133:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response))
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Thanks
Ankur
I've seen something similar in my four node cluster. It seemed that if I specified EVERY cassandra node name in the spark settings, then it works, however if I only specified the seeds (of the four, two were seeds) than I got the exact same issue. I haven't followed up on it, as specifying all four is getting the job done (but I intend to at some point). I'm using hostnames for seed values and not ips. And using hostnames in spark cassandra settings. I did hear it could be due to some akka dns issues. Maybe try using ip addresses through and through, or specifying all hosts. The latter's been working flawlessly for me.
I realized, I was running the application with spark.cassandra.output.concurrent.writes=2. I changed it to 1 and there were no exceptions. The exceptions were because Spark was producing data at much higher frequency than our Cassandra cluster could write so changing the setting to 1 worked for us.
Thanks!!

Neo4J InvalidEpochException when writing to a multi-node cluster

I have a Neo4J Enterprise cluster with 6 nodes (1 master and 5 slaves) hosted in separate Linux (CentOS 6.4) VMs that I am testing an application with. The VMs are hosted in Azure. I have an object that manages connections between all 6 nodes using a simple round-robin like technique. I noticed that when writing to slave nodes, the following error message occurs several times in the messages.log file:
ERROR [o.n.k.h.c.m.MasterServer]: Could not finish off dead channel
org.neo4j.kernel.ha.com.master.InvalidEpochException: Invalid epoch 282880438682249,
correct epoch is 282880443056723 at
org.neo4j.kernel.ha.com.master.MasterImpl.assertCorrectEpoch(MasterImpl.java:218) ~
[neo4j-ha-2.1.2.jar:2.1.2]
at org.neo4j.kernel.ha.com.master.MasterImpl.finishTransaction(MasterImpl.java:363)
~[neo4j-ha-2.1.2.jar:2.1.2]
at
org.neo4j.kernel.ha.com.master.MasterServer.finishOffChannel(MasterServer.java:70)
~[neo4j-ha-2.1.2.jar:2.1.2]
at org.neo4j.com.Server.tryToFinishOffChannel(Server.java:411) ~[neo4j-com 2.1.2.jar:2.1.2]
at org.neo4j.com.Server$4.run(Server.java:589) [neo4j-com-2.1.2.jar:2.1.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_60]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_60]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_60]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_60]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]
I am at a loss. Why does this error occur? When this error occurs, the write fails. I have the servers synchronizing their time via NTP. The application writing to Neo4J is a .Net application using the Neo4JClient library. Thank you for any guidance you can provide.
Amir.
The InvalidEpochException might occur when a master happen while a slave is in progress of committing a transaction. The main reason for unexpected master switches are lengthy GC pauses, longer than cluster timeout settings.
So you need to analyze GC behaviour and optimize them or tweak your cluster timeout settings.

Resources