Azure percent-cpu avg and max are too different - azure

The graph shows cpu's max > 96%, but cpu's avg < 10%
How can this be the case? (I mean, shouldn't cpu's avg > 40, or at least >30?)

Not really, I estimated some of the values from the Graph, and put them in a spreadsheet and calculated a 5 Min Average, as well as calculated the Max CPU and the Average of the 5 Min Average. Below is what it looks like. When you are doing an Average over a time, it smooths out all the peaks and lows.
Max 5 Min Avg
85
40
20
5
25 35
40 26
5 19
10 17
99 35.8
Max Average
99 26.56
If it is continually at high CPU, then your overall average will start growing.
However that average does look rather low on your graph, but you aren't showing the Min CPU either, so it may be short burst where it is high, but more often low CPU usage, you should graph that as well.
Are you trying to configure alerts or scaling? Then you should be looking at the average over a small period e.g. 5 minutes, and if that exceeds a threshold (usually 75-80%) then you send the alert and or scale out.

I asked Microsoft Azure support about this. The answer I received was not great and essentially amounts to "Yes, it does that." They suggested only using the average statistic since (as we've noticed) "max" doesn't work. This is due to the way data gets aggregated internally. The Microsoft Product engineering team has a request (ID: 9900425) in their large list to get this fixed, so it may happen someday.
I did not find any documentation on how that aggregation works, nor would Microsoft provide any.
Existing somewhat useful docs:
Data sources: https://learn.microsoft.com/en-us/azure/azure-monitor/agents/data-sources#azure-resources
Metrics and data collection: https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/data-platform-metrics#data-collection

Related

How to configure an Azure alert for the max cpu held over x time period

I currently have an alert that triggers on average cpu over 80% over the last 15 minutes. However, we have instances that stay over 100% for hours or days while others are very look leaving the average under 80%. I want to find those instances that are higher than 80% for an extended period of time. But what I don't want is a spike and using the max seems only do just that, spend out alerts for spikes and not constants. Anyway I can get such an alert configured?

calculate maximum total of RU/s in 24 hours for CosmosDB

I set 400 RU/s for my collection in CosmosDB. I want to estimate, maximum total of RU/s in 24 hours. please let me know, how can I calculate it?
is the calculation right?
400 * 60 * 60 * 24 = 34.560.000 RU in 24 hours
When talking about maximum RU available within a given time window, you're correct (except that would be total RU, not total RU/second - it would remain constant at 400/sec unless you adjusted your RU setting).
But... given that RUs are reserved on a per-second basis: What you consume within a given one-second window is really up to you and your specific operations, and if you don't consume all allocated RUs (400, in this case) in a given one-second window, it's gone (you're paying for reserved capacity). So yes, you're right about absolute maximum, but that might not match your real-world scenario. You could end up consuming far less than the maximum you've allocated (imagine idle-times for your app, when you're just not hitting the database much). Also note that RUs are distributed across partitions, so depending on your partitioning scheme, you could end up not using a portion of your available RUs.
Note that it's entirely possible to use more than your allocated 400 RU in a given one-second window, which then puts you into "RU Debt" (see my answer here for a detailed explanation).

azure what is Time Aggregation Max or AVG

I am using SQL Azure SQL Server for my App. My app was in was working perfectly till recently and the MAX dtu usage has been 100% but the AVG DTU usage ois around 50%.
Which value should i monitor to scale the services, MAX or AVG?
I found on the net after lots of searching:
CPU max/min and average within that 1 minute. As 1 minute (60 seconds) is the finest granularity, if you chose for example max, if the CPU has touched 100% even for 1 second, it will be shown 100% for that entire minute. Perhaps the best is to use the Average. In this case the average CPU utilization from 60 seconds will be shown under that 1 minute metric.
which sorta helped me with what it all meant, but thanks to bradbury9 too for your input.

Fractional quantities in stripe

I would like to use stripe to bill per GB of data used in an application. Here is an example of an invoice:
Usage: 2.38GB
Rate: $0.20/GB
Total: $0.46
However, it seems like Stripe only allows integer quantity, so how would the above be done? If I were to bill per MB, then I'd have the following:
Usage: 2380MB
Rate: $0.0002
Total: $0.46
However, the lowest rate that I could add (at least from what it looked like in the dashboard) was $0.01.
So, what would be the best way to accomplish the above (other than round to the nearest GB -- which looks a bit fishy to an end user, imho).
Stripe cannot support charging less than the minimum unit for a currency (because you can't charge a customer less than that, for example if they only used 1 unit).
You could set your "Unit" to be the smallest amount of data that would total to one cent, so that you are rounding to the smallest amount possible. In this case, $0.20/GB = $0.0002/MB = $0.01/50MB, 1 cent per 50 MB. You would have to account for this when reporting usage to the API, by keeping track of their usage yourself and updating the API using action= set rather than increment[0].
While this is what you'd have to do behind the scenes, there's no reason you have to expose that to the user. You could still list your rate in units of MB, with a note saying that totals will be rounded to the nearest 50MB.
[0] https://stripe.com/docs/api/usage_records/create#usage_record_create-action

Performance testing - Jmeter results

I am using Jmeter (started using it a few days ago) as a tool to simulate a load of 30 threads using a csv data file that contains login credentials for 3 system users.
The objective I set out to achieve was to measure 30 users (threads) logging in and navigating to a page via the menu over a time span of 30 seconds.
I have set my thread group as:
Number of threads: 30
Ramp-up Perod: 30
Loop Count: 10
I ran the test successfully. Now I'd like to understand what the results mean and what is classed as good/bad measurements, and what can be suggested to improve the results. Below is a table of the results collated in the Summary report of Jmeter.
I have conducted research only to find blogs/sites telling me the same info as what is defined on the jmeter.apache.org site. One blog (Nicolas Vahlas) that I came across gave me some very useful information,but still hasn't help me understand what to do next with my results.
Can anyone help me understand these results and what I could do next following the execution of this test plan? Or point me in the right direction of an informative blog/site that will help me understand what to do next.
Many thanks.
According to me, Deviation is high.
You know your application better than all of us.
you should focus on, avg response time you got and max response frequency and value are acceptable to you and your users? This applies to throughput also.
It shows average response time is below 0.5 seconds and maximum response time is also below 1 second which are generally acceptable but that should be defined by you (Is it acceptable by your users). If answer is yes, try with more load to check scaling.
In you requirement it is mentioned that you need have 30 concurrent users performing different actions. The response time of your requests is less and you have ramp-up of 30 seconds. Can you please check total active threads during the test. I believe the time for which there will be 30 concurrent users in system is pretty short so the average response time that you are seeing seems to be misleading. I would suggest you run a test for some more time so that there will be 30 concurrent users in the system and that would be correct reading as per your requirements.
You can use Aggregate report instead of summary report. In performance testing
Throughput - Requests/Second
Response Time - 90th Percentile and
Target application resource utilization (CPU, Processor Queue Length and Memory)
can be used for analysis. Normally SLA for websites is 3 seconds but this requirement changes from application to application.
Your test results are good, considering if the users are actually logging into system/portal.
Samples: This means the no. of requests sent on a particular module.
Average: Average Response Time, for 300 samples.
Min: Min Response Time, among 300 samples (fastest among 300 samples).
Max: Max Response Time, among 300 samples (slowest among 300 samples).
Standard Deviation: A measure of the variation (for 300 samples).
Error: failure %age
Throughput: No. of request processed per second.
Hope this will help.

Resources