Azure VM stats - Network In/Out - what are the measurements? - azure

I feel perturbed, but I don't understand the measurement Azure uses for Network In/Out and a few other things.
On Azure portal -> my VM -> Metrics -> [Host] Network In/Out, it says that it is measured in bytes, but then it also draws graph over time. If it were plain, bytes, it should be cumulative and therefore grow indefinitely, but it isn't, therefore I am inclined to believe it is measured per second or something like that. But Azure docs claim that it is bytes and not bytes per second (link here)
Am I missing something obvious?

I am inclined to believe the data is in bytes per minute. At least for mine it appears that way. I set my graph for a 10 minute interval. Taking the mouse off the graph the total bytes show at the bottom. Hovering over the individual sample points (10 in total) they average between 31-34MB each. Adding them up you get close to the total for the graph interval 326MB. 10*32.5 is very close to the this total leading me to believe that each interval on the graph is a sum of the individual interval (1 minute). That is what I am seeing anyway. Terrible documentation from Microsoft. Why not just specify this in the (i) hover point on the individual graph?
#eddyP23 - if you add up all your points in your graph it appears you would come to the same conclusion. Each point is a sum of the interval (1 minute). I am not sure how else to read this.
If it were bytes per second the data total for the complete interval would be vastly larger. 10 minute interval

Sorry for the delay.
therefore I am inclined to believe it is measured per second or
something like that. But Azure docs claim that it is bytes and not
bytes per second
You can find the Network In here:
The Network In (bytes per second) used for monitor your VM's performance.

Related

How to aggregate data by period in a rrdtool graph

I have a rrd file with average ping times to a server (GAUGE) every minute and when the server is offline (which is very frequent for reasons that doesn't matter now) it stores a NaN/unknown.
I'd like to create a graph with the percentage the server is offline each hour which I think can be achieved by counting every NaN within 60 samples and then dividing by 60.
For now I get to the point where I define a variable that is 1 when the server is offline and 0 otherwise, but I already read the docs and don't know how to aggregate this:
DEF:avg=server.rrd:rtt:AVERAGE CDEF:offline=avg,UN,1,0,IF
Is it possible to do this when creating a graph? Or I will have to store that info in another rrd?
I don't think you can do exactly what you want, but you have a couple of options.
You can define a sliding window average, that shows the percentage of the previous hour that was unknown, and graph that, using TRENDNAN.
DEF:avg=server.rrd:rtt:AVERAGE:step=60
CDEF:offline=avg,UN,100,0,IF
CDEF:pcavail=offline,3600,TREND
LINE:pcavail#ff0000:Availability
This defines avg as the 1-min time series of ping data. Note we use step=60 to ensure we get the best resolution of data even in a smaller graph. Then we define offline as 100 when the server is there, 0 when not. Then, pcavail is a 1-hour sliding window average of this, which will in effect be the percentage of time during the previous hour during which the server was available.
However, there's a problem in that RRDTool will silently summarise the source data before you get your hands on it, if there are many data points to a pixel in the graph (this won't happen if doing a fetch of course). To get around that, you'd need to have the offline CDEF done at store time -- IE, have a COMPUTE type DS that is 100 or 0 depending on if the avg DS is known. Then, any averaging will preserve data (normal averaging omits the unknowns, or the xff setting makes the whole cdp unknown).
rrdtool create ...
DS:rtt:GAUGE:120:0:9999
DS:offline:COMPUTE:rtt,UN,100,0,IF
rrdtool graph ...
DEF:offline=server.rrd:offline:AVERAGE:step=3600
LINE:offline#ff0000:Availability
If you are able to modify your RRD, and do not need historical data, then use of a COMPUTE in this way will allow you to display your data in a 1-hour stepped graph as you wanted.

Hazelcast management center shows get latency of 0 ms for replicated map

Setup :
3 member embedded cluster deployed as a spring boot jar.
Total keys on each member: 900K
Get operation is being attempted via a rest api.
Background:
I am trying to benchmark the replicated map of hazelcast.
Management center UI shows around 10k/s request being executed but avg get latency per sec is coming 0ms.
I believe it is not showing because it might be in microseconds.
Please let me know how to configure management center UI to show latency in micro/nanoseconds?
Management center UI shows around 10k/s request being executed but avg get latency per sec is coming 0ms.
I believe you're talking about Replicated Map Throughput Statistics in the replicated map details page. The Avg Get Latency column in that table shows on average how much time it took for a cluster member to execute the get operations for the time period that is selected on the top right corner of the table. For example, if you select Last Minute there, you only see the average time it took for the get operations in the last minute.
I believe it is not showing because it might be in microseconds.
Cluster is sending it as milliseconds (calculating it as nanoseconds in a newer cluster version but still sending as milliseconds). However, since a replicated map replicates all data on all members and every member contains the whole data set, get latency is typically very low as there's no network trip.
I guess that the way we render very small metric values confused you. In Management Center UI, we only show two fractional digits. You can see it in action in the below screenshots:
As you can see, since the value is very low, it is shown as 0. I believe we can do a better job rendering these values though (using a smaller time unit for example). I will create an issue for this on our private issue tracker.

Sampling Intervals for Azure Metrics

Could you please help understand how azure metrics are calculated. For an example, we can see the request/total request graph, and a number at a given point in time (say 6000 at 4.34 PM) for Azure API Management. The request count has no meaning at a given point in time, but a measure for a given "period" of time. When i research, i found that the number represent the number of request received during the sampling interval.However, no further data is available on what the sampling interval is. Azure portal/metrics graphs has no setting to view/change sampling interval either.
So what's the sampling interval used for Azure metrics? or what does the total request count meaning at a given point in time?
(However, the application insight metrics allow you to set the sampling interval)
could you please shed some light? thanks
I think the comment from AjayKumar-MSFT is correct, so I summarize if it could help others:
Typically - ‘Requests’ - The total number of requests regardless of their resulting HTTP status code and where as ‘Requests In Application Queue’ is the number of requests in the application request queue. You can always change the 'chart Setting' for much detailed info, by going into the metric and ellipsis (...) /settings

DynamoDB Downscales within the last 24 hours?

Is it anyhow possible to get the count, how often DynamoDB throughput (write units/read units) was downscaled within the last 24 hours?
My idea is to downscale as soon as an hugo drop e.g. 50% in the needed provisioned write units occur. I have really peaky traffic. Thus it is interessting to me to downscale after every peak. However I have a analytics jobs running at night which is provisioning a huge amount of read units making it necessary to be able to downscale after it. Thus I need to limit downscales to 3 times within 24 hours.
The number of decreases is returned in a DescribeTable result as part of the ProvisionedThroughputDescription.

Tracking metrics using StatsD (via etsy) and Graphite, graphite graph doesn't seem to be graphing all the data

We have a metric that we increment every time a user performs a certain action on our website, but the graphs don't seem to be accurate.
So going off this hunch, we invested the updates.log of carbon and discovered that the action had happened over 4 thousand times today(using grep and wc), but according the Integral result of the graph it returned only 220ish.
What could be the cause of this? Data is being reported to statsd using the statsd php library, and calling statsd::increment('metric'); and as stated above, the log confirms that 4,000+ updates to this key happened today.
We are using:
graphite 0.9.6 with statsD (etsy)
After some research through the documentation, and some conversations with others, I've found the problem - and the solution.
The way the whisper file format is designed, it expect you (or your application) to publish updates no faster than the minimum interval in your storage-schemas.conf file. This file is used to configure how much data retention you have at different time interval resolutions.
My storage-schemas.conf file was set with a minimum retention time of 1 minute. The default StatsD daemon (from etsy) is designed to update to carbon (the graphite daemon) every 10 seconds. The reason this is a problem is: over a 60 second period StatsD reports 6 times, each write overwrites the last one (in that 60 second interval, because you're updating faster than once per minute). This produces really weird results on your graph because the last 10 seconds in a minute could be completely dead and report a 0 for the activity during that period, which results in completely nuking all of the data you had written for that minute.
To fix this, I had to re-configure my storage-schemas.conf file to store data at a maximum resolution of 10 seconds, so every update from StatsD would be saved in the whisper database without being overwritten.
Etsy published the storage-schemas.conf configuration that they were using for their installation of carbon, which looks like this:
[stats]
priority = 110
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974
This has a 10 second minimum retention time, and stores 6 hours worth of them. However, due to my next problem, I extended the retention periods significantly.
As I let this data collect for a few days, I noticed that it still looked off (and was under reporting). This was due to 2 problems.
StatsD (older versions) only reported an average number of events per second for each 10 second reporting period. This means, if you incremented a key 100 times in 1 second and 0 times for the next 9 seconds, at the end of the 10th second statsD would report 10 to graphite, instead of 100. (100/10 = 10). This failed to report the total number of events for a 10 second period (obviously).Newer versions of statsD fix this problem, as they introduced the stats_counts bucket, which logs the total # of events per metric for each 10 second period (so instead of reporting 10 in the previous example, it reports 100).After I upgraded StatsD, I noticed that the last 6 hours of data looked great, but as I looked beyond the last 6 hours - things looked weird, and the next reason is why:
As graphite stores data, it moves data from high precision retention to lower precision retention. This means, using the etsy storage-schemas.conf example, after 6 hours of 10 second precision, data was moved to 60 second (1 minute) precision. In order to move 6 data points from 10s to 60s precision, graphite does an average of the 6 data points. So it'd take the total value of the oldest 6 data points, and divide it by 6. This gives an average # of events per 10 seconds for that 60 second period (and not the total # of events, which is what we care about specifically).This is just how graphite is designed, and for some cases it might be useful, but in our case, it's not what we wanted. To "fix" this problem, I increased our 10 second precision retention time to 60 days. Beyond 60 days, I store the minutely and 10-minutely precisions, but they're essentially there for no reason, as that data isn't as useful to us.
I hope this helps someone, I know it annoyed me for a few days - and I know there isn't a huge community of people that are using this stack of software for this purpose, so it took a bit of research to really figure out what was going on and how to get a result that I wanted.
After posting my comment above I found Graphite 0.9.9 has a (new?) configuration file, storage-aggregation.conf, in which one can control the aggregation method per pattern. The available options are average, sum, min, max, and last.
http://readthedocs.org/docs/graphite/en/latest/config-carbon.html#storage-aggregation-conf

Resources