Resource metrics (memory, CPU etc) should I be looking at for auto scaling purpose - resources

What cloud resource metrics (memory, CPU, disk io etc) should I be looking at for auto scaling purpose? FYI, The metrics is strictly used for auto scaling purpose. I have kubernetes architecture and prometheus (for monitoring and scraping metrics)
I have a kubernetes cluster set up in local as well as cloud. I am using Prometheus tool(https://prometheus.io/) set up for scraping system level metrics. Now, I want to have Auto-scaling feature in my system. I have been using prometheus for saving metrics like this. "Memory and CPU used, allocated, total for the last 24 hours." I want to save more metrics. This is the list of metrics that I am getting from Prometheus: http://demo.robustperception.io:9100/metrics I can't decide what more metrics I am going to need for auto scaling purpose. Can anyone suggest some metrics for this purpose? TIA.

Normally, the common bottleneck is the memory hierarchy rather than CPU usage. The more requests your application receives, the more likely to have an out-of-memory error. What is more, if your application is not HPC, it is not likely that it needs to be so CPU-intensive.
In the memory hierarchy, Disk I/O can dramatically affect performance. You would need to check how Disk I/O intensive your application is. In this sense, changing the disk hardware could be a better solution rather than spinning up more instances. However, that depends on the application.
In any case, it would be interesting if you could measure the average response time, and then take decisions accordingly.

Related

What to expect if Azure App Service CPU maxes out momentarily?

I currently have an Azure App Service API that usually runs extremely low average and max CPU (<10% utilization). Every now and again, the CPU will spike due to a temporary spike in client requests. These spikes seemingly only last for a split second, but I’m wondering if this is cause for concern. What is the result of an Azure App Service CPU maxing out temporarily (either for a split second or for several seconds). Will this cause the app to crash, or will it just buffer requests until intensive tasks complete? It is worth noting that despite the spike in CPU, memory utilization remains low. Thanks in advance for input.
It looks like the CPU load is caused entirely from a large number of intensive requests all coming in at the same time.
When the CPU utilization of your Azure App Service API spikes temporarily, it could cause the app to slow down or become unresponsive, depending on the level and duration of the spike.
The system will try to buffer requests during this time, but if the spike is too high, some requests may be dropped or time out. This can result in a poor user experience or errors, especially if the spike is sustained for an extended period of time.
To mitigate this issue, you can take a few steps.
You can optimize the code of your API to reduce the CPU utilization of each request. This could involve reducing the number of operations performed for each request, using caching or other performance-enhancing techniques, or optimizing the algorithm used for processing requests.
You can also consider scaling up the resources of your App Service, such as increasing the number of CPU cores or adding more memory, to handle the increased load during spikes. Another option is to use horizontal scaling by adding more instances of your API to distribute the load across multiple servers, which can help reduce the impact of spikes.
And you can monitor your App Service to detect and respond to spikes in CPU utilization in real-time.
For example, you can use Azure Monitor to set up alerts that trigger when CPU utilization exceeds a certain threshold, and automatically scale your App Service in response.
By checking the Azure monitor logs and Web App Diagnostics, can find the reasons behind CPU Utilization.
Diagnose and solve problems:
CPU Usage:
CPU Drill Down:
References taken from
Application monitoring for Azure App Service
CPU Diagnostics, Identify and Diagnose High CPU issues

Choosing the right EC2 instance for three NodeJS Applications

I'm running three MEAN stack programmes. Each application receives over 10,000 monthly users. Could you please assist me in finding an EC2 instance for my apps?
I've been using a "t3.large" instance with two vCPUs and eight gigabytes of RAM, but it costs $62 to $64 per month.
I need help deciding which EC2 instance to use for three Nodejs applications.
First check CloudWatch metrics for the current instances. Is CPU and memory usage consistent over time? Analysing the metrics could help you to decide whether you should select a smaller/bigger instance or not.
One way to avoid too unnecessary costs is to use auto scaling groups and load balancers. By using them and finding and applying proper settings, you could have always right amount of computing power for your applications.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html
https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html
Depends on your applications. If your apps need more compute power or more memory or more storage? Deciding a server is similar to installing an app on system. Check what are basic requirements for it & then proceed to choose server.
If you have 10k+ monthly customers, think about using ALB so that traffic gets distributed evenly. Try caching to server some content if possible. Use unlimited burst mode of t3 servers if CPU keeps hitting 100%. Also, try to optimize code so that fewer resources are consumed. Once you are comfortable with ec2 choice, try to purchase saving plans or RIs for less cost.
Also, do monitor the servers & traffic using Cloudwatch agent, internet monitor etc features.

What would cause high KUDU usage (and eventual 502 errors) on an Azure App Service Plan?

We have a number of API apps and WebApps on an Azure App Service P2v2 instance. We've been experiencing an amount of platform instability: the App Service becomes unhealthy and we get a rash of 502 errors across various of the Apps (different ones each time), attributable to very high CPU and Memory usage on the app service. We've tried scaling all the way up to P3v2, but whatever the issue is seems eventually to consume all resources available.
Whenever we've been able to trace a culprit among the apps, it has turned dout not to be the app itself but the Kudu service related to it.
A sample error message is High physical memory usage detected on multiple occasions. The kudu process for the app [sitename]'pe-services-color' is the most common cause of high memory usage. The most common cause of high memory usage for the kudu process is web jobs. where the actual app whose Kudu service is named changes quite frequently.
What could be causing the Kudu services to consume so much CPU/Memory, and what can we do to stabilise this app service?
Is it simply that we have too many apps running on one plan? This seems unlikely since all these apps ran previously on a single classic cloud service instance, but if so, what are the limits for apps and slots on a single plan?
(I have seen this question but the answer doesn't help)
Update
From Azure support, these are apparently the limits on Small - Medium - Large non-shared app services:
Worker Size Max sites
Small 5 Medium 10 Large 20
with 'sites' comprising app services/api apps and their slots.
They seem ridiculously low, and make the larger App Service units highly uneconomic. Can anyone confirm these numbers?
(Incidentally, we found that turning off Always On across the board fixed the issue - it was only causing a problem on empty sites though - we haven't had a chance yet to see if performance is good with all the sites filled.)
High CPU and memory utilization would be mostly caused by your program/code itself. If there are lot of CPU intensive tasks and you applied lot of parallel programming that spawn lot of new threads can contribute to high cpu and memory utilization. So review your code and see such instances. When number of parallel threads increased cpu utilization goes high and it starts scaling up frequently that adds up your cost also sometime thread loss and unexpected results. As Azure resources costs are high you need to plan your performance accordingly.
You can monitor this using the Metrics option of the app service plan in the blade .

CPU utilization in performance testing

I am doing performance testing on an app. I found when the number of virtual users increases, the response time increases linearly(should be natural, right?), but the CPU utilization stops increasing when reaches around 60%. Does it mean the CPU is the bottleneck? If not, what could be the bottleneck?
The bottleneck might or might not be CPU, you need to consider monitoring other OS metrics as well, to wit:
Physical RAM
Swap usage
Network IO
Disk IO
Each of them could be the bottleneck.
Also when you increase number of users ideal system should increase the number of TPS (transactions per second) by the same factor. When you increase virtual users and TPS is not getting increased the situation is called saturation point and you need to find out what is slowing your system down.
If resources utilization is far from 95-100% and your system provides large response times the reason can be non-optimal code of your application or slow database query or something like that, in this case you will need to use profiling tools to get to the bottom of the issue.
See How to Monitor Your Server Health & Performance During a JMeter Load Test article for more information on the application under test monitoring concept

bluemix runtime: auto vertical scalability?

It is clear that the auto-scale service allows to automatically scale-in and scale-out an application horizontally. And that I can manually scale vertically my application too by MANUALLY increasing/decreasing the memory.
Is there a way to AUTOMATICALLY increase and decrease the memory associated to the node.js instances based on some rules?
Note Bluemix charges the application based on GB * Hour. So by large you will be charged similarly among vertical scaling and horizontal scaling. However, vertical scaling does improves memory usage efficiency because there is less memory overhead (e.g., you load the node.js runtime only once rather than twice ore more). But horizontal scaling also has its merits:
Better availability due to increased app instances;
Better concurrency due to distributed processes;
Potential better exploration of CPU resource (because of the way CGroup works for CPU allocation).
So if your application is memory-hungry, allocating large memory for each instance would make sense. Otherwise if the app is CPU hungry, then horizontal scaling may work better. You can do some benchmark to evaluate the response time and throughput impact of both options.
The Auto-Scaling add-on in Bluemix monitors the chosen resources against their policies and increases or decreases the number of instances, not vertical scaling (memory).
Why does your node app's memory requirements grow? Can you offload some of it by using a database or cashing service? Relying on increasing memory when needed is currently a bad practice because it will require a small downtime as your application restarts.
As Ram mentioned. The Auto-Scaling service doesn't currently support vertical scaling.
You can scale horizontally by discrete numbers of instances or by a % of the total number of instances.
See the docs for which metrics are supported by each application type
edit: typo!

Resources