PromQL: increase over counter - promql

Here my current counter value:
method_timed_seconds_count{method="getByUId"} ---> 68
After having fetch a http request to my service, this counter is iscremented:
method_timed_seconds_count{method="getByUId"} ---> 69
I want to get how this counter has increased inside a 30s window, using this:
increase(method_timed_seconds_count{method="getByUId"}[30s]) ---> 2
However, I'm getting value 2!
Why? I was expecting to get 1!
Scrape interval is 15s.
Any ideas?

Prometheus has the following issues with increase() calculations:
It extrapolates increase() results - see this issue.
It doesn't take into account the increase between the last raw sample before the specified lookbehind window in square brackets and the first raw sample inside the lookbehind window. See this design doc for details.
It misses the increase for the first raw sample in a time series. For example, if a time series starts from 5 and has the following samples: 5 6 9 12, then increase over these samples would return something around 12-5=7 instead of the expected 12.
That's why it isn't recommended to use increase() in Prometheus for calculating the exact counter increases.
P.S. If you need calculating the exact counter increases over the specified lookbehind window, then try VictoriaMetrics - Prometheus-like monitoring system I work on. It provides increase() function, which is free from the issues mentioned above.

Related

How to aggregate data by period in a rrdtool graph

I have a rrd file with average ping times to a server (GAUGE) every minute and when the server is offline (which is very frequent for reasons that doesn't matter now) it stores a NaN/unknown.
I'd like to create a graph with the percentage the server is offline each hour which I think can be achieved by counting every NaN within 60 samples and then dividing by 60.
For now I get to the point where I define a variable that is 1 when the server is offline and 0 otherwise, but I already read the docs and don't know how to aggregate this:
DEF:avg=server.rrd:rtt:AVERAGE CDEF:offline=avg,UN,1,0,IF
Is it possible to do this when creating a graph? Or I will have to store that info in another rrd?
I don't think you can do exactly what you want, but you have a couple of options.
You can define a sliding window average, that shows the percentage of the previous hour that was unknown, and graph that, using TRENDNAN.
DEF:avg=server.rrd:rtt:AVERAGE:step=60
CDEF:offline=avg,UN,100,0,IF
CDEF:pcavail=offline,3600,TREND
LINE:pcavail#ff0000:Availability
This defines avg as the 1-min time series of ping data. Note we use step=60 to ensure we get the best resolution of data even in a smaller graph. Then we define offline as 100 when the server is there, 0 when not. Then, pcavail is a 1-hour sliding window average of this, which will in effect be the percentage of time during the previous hour during which the server was available.
However, there's a problem in that RRDTool will silently summarise the source data before you get your hands on it, if there are many data points to a pixel in the graph (this won't happen if doing a fetch of course). To get around that, you'd need to have the offline CDEF done at store time -- IE, have a COMPUTE type DS that is 100 or 0 depending on if the avg DS is known. Then, any averaging will preserve data (normal averaging omits the unknowns, or the xff setting makes the whole cdp unknown).
rrdtool create ...
DS:rtt:GAUGE:120:0:9999
DS:offline:COMPUTE:rtt,UN,100,0,IF
rrdtool graph ...
DEF:offline=server.rrd:offline:AVERAGE:step=3600
LINE:offline#ff0000:Availability
If you are able to modify your RRD, and do not need historical data, then use of a COMPUTE in this way will allow you to display your data in a 1-hour stepped graph as you wanted.

Prometheus recording rule to keep the max of (rate of counter)

Iam facing one dillema.
For performance reasons, I'm creating recording rules for my Nginx request/second metrics.
Original Query
sum(rate(nginx_http_request_total[5m]))
Recording Rule
rules:
- expr: sum(rate(nginx_http_requests_total[5m])) by (cache_status, host, env, status)
record: job:nginx_http_requests_total:rate:sum:5m
In original query I can see that my max traffic is 6.6k but in recording rule, its 6.2k. That's 400 TPS difference.
This is the metric for last one week
Question :
Is there any way to take the max of the original query and save it as recording rule. As it's TPS, I only care about the max, not the min.
I think having having 6% difference on value in some very short burst is pretty OK.
In your query your are getting (and recording) an average TPS during the last 5 minutes. There is no "max" being performed there.
The value will change depending on the exact time of the query evaluation - possibly why you see difference between raw query and values stored by recording rule.
Prometheus will extrapolate data some when executing functions like rate(). If you have last data point at time t, but running query at t+30s, then Prometheus will try to extrapolate value at t+30s (often this is noticed as a counter for discrete events will show fractional values)
You may want to use irate() function if you are really after peak values. It will use at each evaluation two most recent points to calculate most current increase as opposed to X minutes average that rate() provides.

Weather Undground API call limit per minute

I have to limit my API request to 10 calls per minute, how can I modify the for loops to accomplish this?
I am trying to add in time.sleep(8) in the for observation loop without any luck... Any ideas?
import arrow # learn more: https://python.org/pypi/arrow
from WunderWeather import weather # learn more: https://python.org/pypi/WunderWeather
import time
api_key = ''
extractor = weather.Extract(api_key)
zip = '53711'
# get 20170101 00:00
begin_date = arrow.get("2017","YYYY")
# get 20171231 23:00
end_date = arrow.get("2018","YYYY").shift(hours=-1)
for date in arrow.Arrow.range('hour',begin_date,end_date):
# get date object for feature
# http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.weather.Extract.date
date_weather = extractor.date(zip,date.format('YYYYMMDD'))
# use shortcut to get observations and data
# http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.date.Observation
for observation in date_weather.observations:
time.sleep(8)
print("Date:",observation.date_pretty)
print("Temp:",observation.temp_f)
A possible explanation of why you are still exceeding the API limit might have to do with the line on which you are adding the time wait. If the API response you are getting contains no observations, the inner loop won't execute. So first I would try to move the time wait in the outer loop right after the API call.
You might also consider using something like loopingCall from twisted to schedule your task to run every X seconds
http://twistedmatrix.com/documents/9.0.0/core/howto/time.html
Depending on how realtime you want your data or you can afford to be a day behind, you could get all observations for a date in the past which would be one API call to retrieve data for a day(or it could be an end of day summary for the current day's observations).
Alternatively, if you're trying to get the current weather every x minutes or so (under the limit)
I'd use some sort of loop with a timer (or possibly twisted which seems to abstract the "loop") but make a call to one of the following (depending on what you're looking for). Your current code is looking for dates in the past but these other endpoints are for the current day.
You don't want the timer in the observations loop since, as mentioned above, there might be none.
http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.weather.Extract.hourly_daycast
http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.weather.Extract.today_now
which can be called similar to the following examples
http://wunderweather.readthedocs.io/en/latest/index.html#additional-examples

Calculating the throughput (requests/sec) and plot it

I'm using JMeter client to test the throughtput of a certain workload (PHP+MySQL, 1 page) on a certain server. Basically I'm doing a "capacity test" with an increasing number of threads over the time.
I installed the "Statistical Aggregate Report" JMeter plugin and this was the result (ignore the "Response time" line):
At the same time I used the "Simple Data Writer" listener to write a log file ("JMeter.csv"). Then I tried to "manually" calculate the throughput for every second of the test.
Each line of "JMeter.csv" has this format:
timestamp elaspedtime responsecode success bytes
1385731020607 42 200 true 325
... ... ... ... ...
The timestamp is referred to the time when the request is made by the client, and not when the request is served by the server. So I simply did: totaltime = timestamp + elapsedtime.
In the next step I converted the totaltime to a date format, like: 13:17:01.
I have more than 14K samples and with Excel I was able to do this quickly.
Then I counted how many samples there were for each second. Example:
totaltime samples (requestsServed/second)
13:17:01 204
13:17:02 297
... ...
When I tried to plot the results I obtained the following graphic:
As you can notice it is far different from the first graphic.
Given that the first graphic is correct, what is the mistake of my formula/procedure to calculate the throughput?
It turns out that this plugin is plotting something that I don't know... I tried many times and my considerations were actually correct. Be careful with this plugin (or check its source code).
Throughput can be view in Jmeter Summary Report and you can calculate by saving your Test Results file in xml file in Summary Report.
Throughput = Number of samples/(Max (ts+t) - Min ts)*1000
Throughput = (Number of samples/The difference between Maximum and minimum response time)*1000
By this formula you can calculate Throughput for each and every http requests in Summary Report.
Example:
Max Response Time = 1485538701633+569 = 1485538702202
Min Response Time = 1485538143112
Throughput = (2/1485538702202-1485538143112)*1000
Throughput = (2/1505) *1000
Throughput = 0.00132890*1000
Throughput = 1.3/sec
You can read more with examples Here(http://www.wikishown.com/how-to-calculate-throughput-in-jmeter/), i got a good idea about Throughput Calculation.

Tracking metrics using StatsD (via etsy) and Graphite, graphite graph doesn't seem to be graphing all the data

We have a metric that we increment every time a user performs a certain action on our website, but the graphs don't seem to be accurate.
So going off this hunch, we invested the updates.log of carbon and discovered that the action had happened over 4 thousand times today(using grep and wc), but according the Integral result of the graph it returned only 220ish.
What could be the cause of this? Data is being reported to statsd using the statsd php library, and calling statsd::increment('metric'); and as stated above, the log confirms that 4,000+ updates to this key happened today.
We are using:
graphite 0.9.6 with statsD (etsy)
After some research through the documentation, and some conversations with others, I've found the problem - and the solution.
The way the whisper file format is designed, it expect you (or your application) to publish updates no faster than the minimum interval in your storage-schemas.conf file. This file is used to configure how much data retention you have at different time interval resolutions.
My storage-schemas.conf file was set with a minimum retention time of 1 minute. The default StatsD daemon (from etsy) is designed to update to carbon (the graphite daemon) every 10 seconds. The reason this is a problem is: over a 60 second period StatsD reports 6 times, each write overwrites the last one (in that 60 second interval, because you're updating faster than once per minute). This produces really weird results on your graph because the last 10 seconds in a minute could be completely dead and report a 0 for the activity during that period, which results in completely nuking all of the data you had written for that minute.
To fix this, I had to re-configure my storage-schemas.conf file to store data at a maximum resolution of 10 seconds, so every update from StatsD would be saved in the whisper database without being overwritten.
Etsy published the storage-schemas.conf configuration that they were using for their installation of carbon, which looks like this:
[stats]
priority = 110
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974
This has a 10 second minimum retention time, and stores 6 hours worth of them. However, due to my next problem, I extended the retention periods significantly.
As I let this data collect for a few days, I noticed that it still looked off (and was under reporting). This was due to 2 problems.
StatsD (older versions) only reported an average number of events per second for each 10 second reporting period. This means, if you incremented a key 100 times in 1 second and 0 times for the next 9 seconds, at the end of the 10th second statsD would report 10 to graphite, instead of 100. (100/10 = 10). This failed to report the total number of events for a 10 second period (obviously).Newer versions of statsD fix this problem, as they introduced the stats_counts bucket, which logs the total # of events per metric for each 10 second period (so instead of reporting 10 in the previous example, it reports 100).After I upgraded StatsD, I noticed that the last 6 hours of data looked great, but as I looked beyond the last 6 hours - things looked weird, and the next reason is why:
As graphite stores data, it moves data from high precision retention to lower precision retention. This means, using the etsy storage-schemas.conf example, after 6 hours of 10 second precision, data was moved to 60 second (1 minute) precision. In order to move 6 data points from 10s to 60s precision, graphite does an average of the 6 data points. So it'd take the total value of the oldest 6 data points, and divide it by 6. This gives an average # of events per 10 seconds for that 60 second period (and not the total # of events, which is what we care about specifically).This is just how graphite is designed, and for some cases it might be useful, but in our case, it's not what we wanted. To "fix" this problem, I increased our 10 second precision retention time to 60 days. Beyond 60 days, I store the minutely and 10-minutely precisions, but they're essentially there for no reason, as that data isn't as useful to us.
I hope this helps someone, I know it annoyed me for a few days - and I know there isn't a huge community of people that are using this stack of software for this purpose, so it took a bit of research to really figure out what was going on and how to get a result that I wanted.
After posting my comment above I found Graphite 0.9.9 has a (new?) configuration file, storage-aggregation.conf, in which one can control the aggregation method per pattern. The available options are average, sum, min, max, and last.
http://readthedocs.org/docs/graphite/en/latest/config-carbon.html#storage-aggregation-conf

Resources