Graphite: Aggregation Rules not working - linux

I have added many aggregation rules like
app.email.server1.total-sent.1d.sum (86400) = sum app.email.server1.total-sent.1h.sum
I want to know is there any limit on the aggregation rules count. Same kind of other aggregation rules are working.
I checked by using tcpdump also, packets containing the tag app.email.server1.total-sent.1h.sum is also coming.
Can we debug by checking logs. I tried but logs is not mentioning anything regarding the type of metrics getting aggregated.

You want to sum up all 1h to 1d, so in the rule, on the RHS, put * instead of 1h
app.email.server1.total-sent.1d.sum (86400) = sum app.email.server1.total-sent.*.sum

Related

Azure App Service metrics aggregation for requests: why are Sum and Count different?

When looking at the metrics from our app services in Azure, I'm very confused at Sum and Count's aggregation metrics for requests. They should be the same, according to the MS tech doc.
Count: The number of measurements captured during the aggregation interval.
When the metric is always captured with the value of 1, the count aggregation is equal to the sum aggregation. This scenario is common when the metric tracks the count of distinct events and each measurement represents one event. The code emits a metric record every time a new request arrives.
And this MS tech doc as well.
Though not the case in this example, Count is equal to Sum in cases where a metric is always captured with the value of 1. This is common when a metric tracks the occurrence of a transactional event--for example, the number of HTTP failures mentioned in a previous example in this article.
So, let say, for a specific period, if there are 10 HTTP requests, the count of requests is 10, then the sum of requests is also 10.
But ours are all different. Below are one web app service's Sum and Count metrices, you can see they are very different. But why?
From offical restapi, we can see that count and sum are still different.
If you want more explanation, you can refer to the following post, or raise a support for help.
Related Post:
Azure App Service Metrics - How to interpret Sum vs. Count related to requests?

Prometheus recording rule to keep the max of (rate of counter)

Iam facing one dillema.
For performance reasons, I'm creating recording rules for my Nginx request/second metrics.
Original Query
sum(rate(nginx_http_request_total[5m]))
Recording Rule
rules:
- expr: sum(rate(nginx_http_requests_total[5m])) by (cache_status, host, env, status)
record: job:nginx_http_requests_total:rate:sum:5m
In original query I can see that my max traffic is 6.6k but in recording rule, its 6.2k. That's 400 TPS difference.
This is the metric for last one week
Question :
Is there any way to take the max of the original query and save it as recording rule. As it's TPS, I only care about the max, not the min.
I think having having 6% difference on value in some very short burst is pretty OK.
In your query your are getting (and recording) an average TPS during the last 5 minutes. There is no "max" being performed there.
The value will change depending on the exact time of the query evaluation - possibly why you see difference between raw query and values stored by recording rule.
Prometheus will extrapolate data some when executing functions like rate(). If you have last data point at time t, but running query at t+30s, then Prometheus will try to extrapolate value at t+30s (often this is noticed as a counter for discrete events will show fractional values)
You may want to use irate() function if you are really after peak values. It will use at each evaluation two most recent points to calculate most current increase as opposed to X minutes average that rate() provides.

Kibana - add a listener

I have ELK installed, and all works fine. I have one index that always receives logs from Logstash.
Sometimes, Logstash stops working (every second month or so), and nothing comes to the index.
I was wondering is there a way to query the index (some interval), if it does not have any entries to produce some kind of event, which I will handle.
For example, query that index every 10 mins, and if there are no logs, then create an event.
I assume you are looking for ELK's internal tools. There is the Elasticsearch Xpack plugin that gives watchers and notifications. But if that's not a requirement, you can write a nodeJS server that querys the last 5 minutes or so, and you can write the exact notification you need.
I hope I could help.

Find documents in MongoDB with non-typical limit

I have a problem, but don't have idea how to resolve it.
I've got PointValues collection in MongoDB.
PointValue schema has 3 parameters:
dataPoint (ref to DataPoint schema)
value (Number)
time (Date)
There is one pointValue for every hour (24 per day).
I have API method to get PointValues for specified DataPoint and time range. Problem is I need to limit it to max 1000 points. Typical limit(1000) method isn't good way, because I need point for whole, specified time range, with time step depends on specified time range and point values count.
So... for example:
Request data for 1 year = 1 * 365 * 24 = 8760
It should return 1000 values but approx 1 value per (24 / (1000 / 365)) = ~9 hours
I don't have idea what method i should use to filter that data in MongoDB.
Thanks for help.
Sampling exactly like that on the database would be quite hard to do and likely not very performant. But an option which gives you a similar result would be to use an aggregation pipeline which $group's the $first best value by $year, $dayOfYear, and $hour (and $minute and $second if you need smaller intervals). That way you can sample values by time steps, but your choices of step lengths are limited to what you have date-operators for. So "hourly" samples is easy, but "9-hourly" samples gets complicated. When this query is performance-critical and frequent, you might want to consider to create additional collections with daily, hourly, minutely etc. DataPoints so you don't need to perform that aggregation on every request.
But your documents are quite lightweight due to the actual payload being in a different collection. So you might consider to get all the results in the requested time range and then do the skipping on the application layer. You might want to consider combining this with the above described aggregation to pre-reduce the dataset. So you could first use an aggregation-pipeline to get hourly results into the application and then skip through the result set in steps of 9 documents. Whether or not this makes sense depends on how many documents you expect.
Also remember to create a sorted index on the time-field.

Performance drop dramatically when levels get deeper in graph travelsal

I've been working on a config management system using arangodb which collect config data for some common software and stream to a program which will generate the relationship among those softwares based on some pre-defined rules and then save the relations into arangodb. After the relations established, I provides APIs to query the data. One important query is to generate the topology of these softwares. I use graph traversal to generate the topology with following AQL:
for n in nginx for v,e,p in 0..4 outbound n forward, dispatch, route,INBOUND deployto, referto,monitoron filter #domain in p.edges[0].server_name return {id: v._id, type: v.ci_type}
which can generate the following topology:
software relation topology
Which looks fine. However, It takes around 10 seconds to finish the query which is not acceptable because the volume is not very large. I checked all the collections and the largest collection, the "forward" edge collection only has around 28000 documents. So I did some tests:
I changed depth from 0..4 to 0..2 and it only takes 0.3 second to finish the query
I changed depth from 0..4 to 0..3, it takes around 3 seconds
for 0..4, it takes around 10 seconds
Since there is a server_name property on the "forward" edge, so I add a hash index(server_name[*]) but it seems arangodb doesn't use the index from the explain execute plan
Any tips I can optimize the query? and why the index can't be used in this case?
Hope someone can help me out with this. Thanks in advance,
First of all i have tried your query and i could see that for some reason the:
filter #domain in p.edges[0].server_name
Is not optimized correctly. This seems to be an internal issue with the optimization rule not being good enough, i will take a detailed look into this and try to make sure that it works as expected.
For this reason it will not yet be able to use a different index for this case, and will not do short-circuit to abort search on level 1 correctly.
I am very sorry for the inconvenience, as the way you did it should be the correct one.
To have a quick workaround for now you could split the first part of the query in a separate step:
This is the fast version of my modified query (which will not include the nginx, see slower version)
FOR n IN nginx
FOR forwarded, e IN 1 OUTBOUND forward
FILTER #domain IN e.server_name
/* At this point we only have the relevant first depth vertices*/
FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, referto, monitoron
RETURN {id: v._id, type: v.ci_type}
This is a slightly slower version of my modified query (saving your output format, and i think it will be faster than the one your are working with)
FOR tmp IN(
FOR n IN nginx
FOR forwarded, e IN 1 OUTBOUND forward
FILTER #domain IN e.server_name
/* At this point we only have the relevant first depth vertices*/
RETURN APPEND([{id: n._id, type: n.ci_type}],(
FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, referto, monitoron
RETURN {id: v._id, type: v.ci_type}
)
)[**]
RETURN tmp
In i can give some general advise:
(This will work after we fixed the optimizer) Usage of the index: ArangoDB uses statistics/assumptions of the index selectivity (how good it is to find the data) to decide which index is better. In your case it may assume that the edge-index is better than your hash-index. You could try to create a combined hash_index on ["_from", "server_name[*]"] which is more likely to have a better estimate than the EdgeIndex and could be used.
In the example you have given i can see that there is a "large" right part starting at the apppkg node. In the query this right part an be reached in two ways:
a) nginx -> tomcat <- apppkg
b) nginx -> varnish -> lvs -> tomcat <- apppkg
This means the query could walk through the subtree starting at apppkg multiple times (once for every path leading there). With the query depth of 4 and only this topology it does not happen, but if there are shorter paths this may also be an issue. If i am not mistaken than you are only interested in the distinct vertices in the graph and the path is not important right? If so you can add OPTIONS to the query that will make sure that no vertex (and dependent subtree) is analysed twice. The modified query would look like this:
for n in nginx
for v,e,p in 0..4 outbound n forward, dispatch, route, INBOUND deployto, referto, monitoron
OPTIONS {bfs: true, uniqueVertices: "global"}
filter #domain in p.edges[0].server_name
return {id: v._id, type: v.ci_type}
the change i made is that i add options to the traversal:
bfs: true => Means we do a breadth-first-search instead of a depth-first-search, we only need this to make the result deterministic and make sure that all vertices with a path of depth 4 will be reached correctly
uniqueVertices: "global" => Means whenever a vertex is found in one traversal (so in your case for every nginx separately) it is flagged and will not be looked at again.
If you need the list of all distinct edges as well you should use uniqueEdges: "global" instead of uniqueVertices: "global" which will make this uniqueness check on edge level.

Resources