Issue with running KairosDb on Cassandra - cassandra

I'm running KairosDb (time series db) on top of Cassandra. I'm experiencing a recurring debug message from Cassandra/Kairos that comes up all the time. They are able to connect to each other but I fear something is not correctly configured. Here is a snippet of the error and how its presented over and over in time:
08:06:00.000 [QuartzScheduler_QuartzSchedulerThread] DEBUG
[QuartzSchedulerThread.java:268] - batch acquisition of 0 triggers
08:06:00.000 [QuartzScheduler_Worker-3] DEBUG [JobRunShell.java:212] -
Calling execute on job
DEFAULT.org.kairosdb.core.reporting.MetricReporterService
08:06:00.001 [QuartzScheduler_Worker-3] DEBUG
[MetricReporterService.java:93] - Reporting metrics
08:06:09.335
[Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1]
DEBUG [CassandraHostRetryService.java:123] - Retry service fired... nothing
to do.
Has anyone experienced this before and knows the issue?
Thanks in advance!

there are no problem with these messages, these are perfectly normal.
KairosDB has some jobs triggered by Quartz that reports metrics, and when using Cassandra (Hector client) it checks for the hosts that have been marked down or are unresponsive.

Related

AppInsights - Monitor for Hung Processes

We are looking at implementing AppInsights for our non-web application. One of the things that we want to monitor for is processes that may be "hung" for more than N number of seconds or minutes. I have been unable to find something built in that does this. The closest thing I have seen or thought of would be to log 2 custom events for the start and end of a process, and then have an alert for a custom log that queries events with no matching "end" event after N minutes.
Is there another way to monitor for hung processes using AppInsights that I am not seeing? Thanks for any help.
If you choose to use application insights, here is the suggestion just for your reference(but if you have another better solution, you can ignore this):
As per this post, you can leverage heartbeat feature, details as below:
if this application runs more than several seconds, you can leverage heartbeat
feature - it sends metric every N minutes/seconds (configurable) and the absence of such
metric will indicate that application is no longer actively running. However, if
Application Insights thread survives, then heartbeat will still be reported.
You can rely on presense/absense of the telemetry from this app in general as well as
couple custom events as you outlined above - Azure Monitor allows to set an alert on
analytics query, so you'll be able to craft a query that returns nothing in case of
application issues and set an alert on 0 count returned by such a query.

Request timed out is not logging in server side Cassandra

I have set server timeout in cassandra as 60 seconds and client timeout in cpp driver as 120 seconds.
I use Batch query which has 18K operations, I get the Request timed out error in cpp driver logs but in Cassandra server logs there is no TRACE available in spite of enabling ALL logs in Cassandra logback.xml
So how can I confirm that It is thrown from the server / client side in Cassandra?
BATCH is not intended to work that way. It’s designed to apply 6 or 7 mutations to different tables atomically. You’re trying to use it like it’s RDBMS counterpart (Cassandra just doesn’t work that way). The BATCH timeout is designed to protect the node/cluster from crashing due to how expensive that query is for the coordinator.
In the system.log, you should see warnings/failures concerning the sheer size of your BATCH. If you’ve modified them and don’t see that, you should see a warning about a timeout threshold being exceeded (I think BATCH gets its own timeout in 3.0).
If all else fails, run your BATCH statement (part of it) in cqlsh with tracing on, and you’ll see precisely why this is a bad idea (server side).
Also, the default query timeouts are there to protect your cluster. You really shouldn’t need to alter those. You should change your query/model or approach before looking at adjusting the timeout.

Cassandra Nodes Going Down

I have a 3 node Cassandra cluster setup (replication set to 2) with Solr installed, each node having RHEL, 32 GB Ram, 1 TB HDD and DSE 4.8.3. There are lots of writes happening on my nodes and also my web application reads from my nodes.
I have observed that all the nodes go down after every 3-4 days. I have to do a restart of every node and then they function quite well till the next 3-4 days and again the same problem repeats. I checked the server logs but they do not show any error even when the server goes down. I am unable to figure out why is this happening.
In my application, sometimes when I connect to the nodes through the C# Cassandra driver, I get the following error
Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: 'node-ip':9042) at Cassandra.Tasks.TaskHelper.WaitToComplete(Task task, Int32 timeout) at Cassandra.Tasks.TaskHelper.WaitToComplete[T](Task``1 task, Int32 timeout) at Cassandra.ControlConnection.Init() at Cassandra.Cluster.Init()`
But when I check the OpsCenter, none of the nodes are down. All nodes status show perfectly fine. Could this be a problem with the driver? Earlier I was using Cassandra C# driver version 2.5.0 installed from nuget, but now I updated even that to version 3.0.3 still this errors persists.
Any help on this would be appreciated. Thanks in advance.
If you haven't done so already, you may want to look at setting your logging levels to default by running: nodetool -h 192.168.XXX.XXX setlogginglevel org.apache.cassandra DEBUG on all your nodes
Your first issue is most likely an OutOfMemory Exception.
For your second issue, the problem is most likely that you have really long GC pauses. Tailing /var/log/cassandra/debug.log or /var/log/cassandra/system.log may give you a hint but typically doesn't reveal the problem unless you are meticulously looking at the timestamps. The best way to troubleshoot this is to ensure you have GC logging enabled in your jvm.options config and then tail your gc logs taking note of the pause times:
grep 'Total time for which application threads were stopped:' /var/log/cassandra/gc.log.1 | less
The Unexpected exception during request; channel = [....] java.io.IOException: Error while read (....): Connection reset by peer error is typically inter-node timeouts. i.e. The coordinator times out waiting for a response from another node and sends a TCP RST packet to close the connection.

Mqtt paho using spring integration stops processing messages on topic over certain load requests

I am using Spring Integration with mqtt-paho version 4.0.4 For receiving MQTT messages on specified topic.
When application is receiving huge load I found that, sometimes application is dropping connection with IMA (mqtt) and this was happened three times in a span of 1 Lac record.
But it regains the connectivity and started consuming messages received there after. There were no issue in IMA re-connectivity.
There is some other issue which I faced during this testing.
When there is continuous load on application, at some point application stops receiving messages and we can see one message flashed on screen i.e.
May 04, 2015 2:45:29 PM org.eclipse.paho.client.mqttv3.internal.ClientState checkForActivity
SEVERE: gvjIpONtSpP: Timed out as no activity, keepAlive=60,000 lastOutboundActivity=1,430,730,869,017 lastInboundActivity=1,430,730,929,151
After this we can see that there are no messages received on application even if continuous load is pushed through utility.
This behavior I found it three times.
At around 40K.
At around 90K.
At around 145K.
There is no consistent point or figures where application actually stops receiving messages.
Please let me know if anybody has faced and solved this before .
We had the same issue during performance testing and during MQTT Paho client performance/durability testing, before moving to production. The issue was on broker side, after settings adjustment, the IMA broker was able to consume millions of messages with no rejection.
Please look into max buffer parameter on IMA configuration web console. And overlimit behavior policy (what to do with messages published over specified threshold): reject, rollover etc.

Azure Eventhub Apache Storm issue

I followed this article to try eventhub with Apache Storm, But when I run the Storm topology it was receiving events for a minute and then it stopped receiving. So I've restarted my program and then it was receiving the remaining messages. Every time I run the program after a minute it couldn't receive from eventhub. Please help me with the possibilities of the issue...
Should I change any configurations at Storm or Zookeeper.
The above jar contains a fix for a known issue in the QPID JMS client, which is used by the Event Hub spout implementation. When the service sends an empty frame (heartbeat to keep connection alive), a decoding error occurs in the client and that causes the client to stop processing commands. Details of this issue can be found here: https://issues.apache.org/jira/browse/QPID-6378

Resources