Cassandra metrics- difference between latency to total latency - cassandra

I'm using Cassandra 2.2 and sending the Cassandra metrics to Graphite using pluggable metrics,
I've searched in org.apache.cassandra.metrics.ColumnFamily and saw there is an attribute "count" in ReadLatency and ReadTotalLatency,
What is the difference between the 2 count attributes?
My main goal is to get the latency per read/write, how do you advise me to get it?
Thanks!

org.apache.cassandra.metrics.ColumnFamily.ReadTotalLatency is a Counter which gives the sum of all read latencies.
org.apache.cassandra.metrics.ColumnFamily.ReadLatency is a Timer which gives insights about how long the reads are taking, it reports attributes like min, max, mean, 75percentile, 90percentile, 99percentile
for your purpose you should be using the ReadLatency and Writelatency

Difference of 2 "count" attributes
org.apache.cassandra.metrics.ColumnFamily.ReadTotalLatency is a Counter.
Its "count" attribute provides the sum of all read latencies.
org.apache.cassandra.metrics.ColumnFamily.ReadLatency is a Timer.
Its "count" attribute provides the count of Timer#update calls.
To get the recent latency per read/write
Use attributes like "min", "max", "mean", "75percentile", "90percentile", "99percentile".
Cassandra 2.2.7 uses DecayingEstimatedHistogramReservoir for Timer's reservoir, which makes recent values more significant.

Related

How To Get the Aggregated Usage of CPU in GRAFANA when hyperthreading is enabled

We are running GRAFANA/PROMETHEUS to monitor our CPU metrics and find aggregated CPU Usage of all cpus. the problem is we have enabled hyperthreading and when we stress CPU the percentage exceeds from 100%. my question is how to limit that cpu usage to show only usage in 100% not more even if cpu is highly utilized.
P.S i have tried setting the max and min limit in grafana but still the graph spikes goes above that limit.
Kindly give me the right query for this problem.
The queries I have tried are given below.
sum(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
100 - avg(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
and other similar queries we have tried.
If all you want is to "cap" a variable or expression result to a maximum value (that is, 100) you could simply use the Prometheus function clamp_max.
Thus, you could do:
clamp_max(<expr>, 100)
This is probably the most helpful query.
(1 - avg(irate(node_cpu_seconds_total{instance="$instance",job="$job",mode!="idle"}[5m])))*100
Replace your instance IP and your node exporter job name.

Hazelcast management center shows get latency of 0 ms for replicated map

Setup :
3 member embedded cluster deployed as a spring boot jar.
Total keys on each member: 900K
Get operation is being attempted via a rest api.
Background:
I am trying to benchmark the replicated map of hazelcast.
Management center UI shows around 10k/s request being executed but avg get latency per sec is coming 0ms.
I believe it is not showing because it might be in microseconds.
Please let me know how to configure management center UI to show latency in micro/nanoseconds?
Management center UI shows around 10k/s request being executed but avg get latency per sec is coming 0ms.
I believe you're talking about Replicated Map Throughput Statistics in the replicated map details page. The Avg Get Latency column in that table shows on average how much time it took for a cluster member to execute the get operations for the time period that is selected on the top right corner of the table. For example, if you select Last Minute there, you only see the average time it took for the get operations in the last minute.
I believe it is not showing because it might be in microseconds.
Cluster is sending it as milliseconds (calculating it as nanoseconds in a newer cluster version but still sending as milliseconds). However, since a replicated map replicates all data on all members and every member contains the whole data set, get latency is typically very low as there's no network trip.
I guess that the way we render very small metric values confused you. In Management Center UI, we only show two fractional digits. You can see it in action in the below screenshots:
As you can see, since the value is very low, it is shown as 0. I believe we can do a better job rendering these values though (using a smaller time unit for example). I will create an issue for this on our private issue tracker.

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

oracle: Is there a way to check what sql_id downgraded to serial or lesser degree over the period of time

I would like to know if there is a way to check sql_ids that were downgraded to either serial or lesser degree in an Oracle 4-node RAC Data warehouse, version 11.2.0.3. I want to write a script and check the queries that are downgraded.
SELECT NAME, inst_id, VALUE FROM GV$SYSSTAT
WHERE UPPER (NAME) LIKE '%PARALLEL OPERATIONS%'
OR UPPER (NAME) LIKE '%PARALLELIZED%' OR UPPER (NAME) LIKE '%PX%'
NAME VALUE
queries parallelized 56083
DML statements parallelized 6
DDL statements parallelized 160
DFO trees parallelized 56249
Parallel operations not downgraded 56128
Parallel operations downgraded to serial 951
Parallel operations downgraded 75 to 99 pct 0
Parallel operations downgraded 50 to 75 pct 0
Parallel operations downgraded 25 to 50 pct 119
Parallel operations downgraded 1 to 25 pct 2
Does it ever refresh? What conclusion can be drawn from above output? Is it for a day? month? hour? since startup?
This information is stored as part of Real-Time SQL Monitoring. But it requires licensing the Diagnostics and Tuning packs, and it only stores data for a short period of time.
Oracle 12c can supposedly store SQL Monitoring data for longer periods of time. If you don't have Oracle 12c, or if you don't have those options licensed, you'll need to create your own monitoring tool.
Real-Time SQL Monitoring of Parallel Downgrades
select /*+ parallel(1000) */ * from dba_objects;
select sql_id, sql_text, px_servers_requested, px_servers_allocated
from v$sql_monitor
where px_servers_requested <> px_servers_allocated;
SQL_ID SQL_TEXT PX_SERVERS_REQUESTED PX_SERVERS_ALLOCATED
6gtf8np006p9g select /*+ parallel ... 3000 64
Creating a (Simple) Historical Monitoring Tool
Simplicity is the key here. Real-Time SQL Monitoring is deceptively simple and you could easily spend weeks trying to recreate even a tiny portion of it. Keep in mind that you only need to sample a very small amount of all activity to get enough information to troubleshoot. For example, just store the results of GV$SESSION or GV$SQL_MONITOR (if you have the license) every minute. If the query doesn't show up from sampling every minute then it's not a performance issue and can be ignored.
For example: create a table create table downgrade_check(sql_id varchar2(100), total number), and create a job with DBMS_SCHEDULER to run insert into downgrade_check select sql_id, count(*) total from gv$session where sql_id is not null group by sql_id;. Although the count from GV$SESSION will rarely be exactly the same as the DOP.
Other Questions
V$SYSSTAT is updated pretty frequently (every few seconds?), and represents the total number of events since the instance started.
It's difficult to draw many conclusions from those numbers. From my experience, having only 2% of your statements downgraded is a good sign. You likely either have good (usually default) settings and not too many parallel jobs running at once.
However, some parallel queries run for seconds and some run for weeks. If the wrong job is downgraded even a single downgrade can be disastrous. Storing some historical session information (or using DBA_HIST_ACTIVE_SESSION_HISTORY) may help you find out if your critical jobs were affected.

Logstash metrics plugin: What does events.rate_5m mean?

This is should be a fairly easy question for Logstash veterans.
When I use the metrics plugin, what does events.rate_5m mean?
Does it mean: Number of events per second in a 5 minute window?
Does it mean: Number of events every 5 minutes?
Also, what's the difference between using this over timer.rate_5m?
The documentation isn't very clear and I have problems understanding it.
Thanks in advance!
Logstash uses the Metriks library to generate the metrics.
According to that site:
A meter that measures the mean throughput and the one-, five-, and fifteen-minute exponentially-weighted moving average throughputs.
and
A timer measures the average time as well as throughput metrics via a meter.
A meter counts events and a timer is used to look at durations (you have to pass a name and a value into a timer).
To answer your specific question, the rate_5m is the per second rate over the last 5 minute sliding window.

Resources