I need to report below metrics for a cassandra cluster at the end of the day
Cluster Load (Avg.)
No of read requests
No. of write requests
Read Latency
Write Latency
Long GC Pauses
No. of Connections.
These have to be cluster wise and not node wise.
Currently we are preparing the report through opscenter, which is a very manual process, I am planning to automate this task by writing a script.
As I am new to Cassandra would like suggestions for where to begin with from experience folks here.
Can this all be done using nodetool?
Thanks,
MT
There are several possibilities here:
If you use OpsCenter, you can use OpsCenter Metrics API to retrieve needed data. API allows to ask for data in given time range (start, end & end parameters), and then you can do any calculation on that data... The only catch is that you can't mix histogram data with gauges, etc.
You can export DSE Metrics Collector's data via Prometheus, and then work with that data. There is a predefined config file for Prometheus, plus Grafana dashboards.
Related
So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
——————————
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
——————————-
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Observations
———————————-
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Hope it helps.
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.
I'm gonna set up monitoring Spark application via $SPARK_HOME/conf/metrics.propetries.
And decided to use Graphite.
Is there any way to estimate the database size of Graphite especially for monitoring Spark application?
Regardless of what you are monitoring, Graphite has its own configuration about retention and rollup of metrics. It stores file (called whisper) per metric and you can use the calculator to estimate how much disk space it can take https://m30m.github.io/whisper-calculator/
I have to perform the benchmarking of spark streaming processing. My process gets pulls messages from the kafka, process and loads into ElasticSearch. The upstream generates 100k records per second. So I would like to calculate how many messages processed in 1 second and the latency time. Is there any tools available to monitor this or is there any process to calculate this.
Spark UI can help you,providing the necessary details you need.
By default, the spark ui is available on http://:4040 in a web browser(For a single spark Context).
For the help,you can use this link: http://spark.apache.org/docs/latest/monitoring.html
Beyond the Spark UI, which is useful for determining the rate of processing of your data, you can also use third-party tools like spark-perf to perform load testing on your cluster and obtain benchmark data in that way as well.
Maybe someone should try Yahoo's streaming-benchmarks, I found databricks use that tool to do benchmark between spark streaming and flink.
https://github.com/yahoo/streaming-benchmarks
https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html
How to collect data from all nodes within cluster from single node in cassandra.
Does jmx provide aggregated values for all nodes which are present on same cluster on single node?
Yes. For Cassandra cluster you will be able to do so.As per my knowledge there are two well know ways for monitoring and getting cluster status.
nodetool utility :
The nodetool utility is a command-line interface for monitoring Cassandra and performing routine database operations. Included in the Cassandra distribution, nodetool and is typically run directly from an operational Cassandra node.
Datastax Ops-center : OpsCenter provides a graphical representation of performance trends in a summary view that is hard to obtain with other monitoring tools. The GUI provides views for different time periods as well as the capability to drill down on single data points. Both real-time and historical performance data for a Cassandra or DataStax Enterprise cluster are available in OpsCenter. OpsCenter metrics are captured and stored within Cassandra.
I think the the first way (nodetool utility) will be more useful to meet your requirements.
You will get more information at
Cassandra cluster monitoring and nodetool options.
JMX provides information from a single node. To have information about entire cluster we collect data from all nodes into Zabbix. Zabbix allows to create graphs and screens that show jmx values from all nodes in one place. E.g. we can see all Read Pending Tasks for all nodes in single graph.
I think, to have separate information for each node in one place it's better solution to diagnose possible issues than to have common aggregate information.
Regarding metrics, I can recommend Guide to Cassandra Thread Pools that provides a description of the different cassandra metrics and how to monitor them.
I am new to OpsCenter and trying to get a feel for the metric graphs. The graphs seem slow to refresh and I'm trying to determine if this is a configuration issue on my part or simply what to expect.
For example, I have a three node Cassandra test cluster created via CCM. OpsCenter and the node Agents were configured manually.
I have graphs on the dashboard for Read and Write Requests and Latency. I'm running a JMeter test that inserts 100k rows into a Cassandra table (via REST calls to my webapp) over the course of about 5 minutes.
I have both OpsCenter and VisualVm open. When the test kicks off, VisualVM graphs immediately start showing the change in load (via Heap and CPU/GC graphs) but the OpsCenter graphs lag behind and are slow to update. I realize I'm comparing different metrics (ie. Heap vs Write Requests) but I would expect to see some immediate indication in OpsCenter that a load is being applied.
My environment is as follows:
Cassandra: dsc-cassandra-2.1.2
OpsCenter: opscenter-5.1.0
Agents: datastax-agent-5.1.0
OS: OSX 10.10.1
Currently metrics are collected every 60 seconds, plus there’s a (albeit very small) overhead on inserting them into C*, reading back on the OpsCenter server side, and pushing to the UI.
OpsCenter team is working on both improving metrics collection in general and on delivering realtime metrics, so stay tuned.
By the way, comparing VisualVM and OpsCenter in terms of latencies is not quite correct since OpsCenter has to do a lot more work to both collect and aggregate those metrics due to its distributed nature (and also because VisualVM is so close to the meta^WJVM ;)