Azure App Service Autoscale Fails to Scale In - azure

My app service has failed to scale-in after scaling-out. This seems to be a pattern I've been trying to troubleshoot for several months.
I've tried the following but none have worked:
My scale condition is based on CPU and memory. However, I've never seen CPU go past 12%, so I'm assuming it's actually scaling based on memory.
Set the scale out condition to memory over 90% over a 5 minute average with 10 min. cooldown and scale in condition for memory under 70% over a 5 minute average. This doesn't seem to make sense since if my memory utilization is already at 90%, I'm really having underlying memory leaks and should have already scaled out.
Set the scale out condition to memory over 80% over a 60 minute average with 10 min. cooldown and scale in condition for memory under 60% over a 5 minute average. This makes more sense, as I've seen memory usage burst over a few hours only to drop.
Expected behavior: App service autoscaling will reduce instance count after 5 minutes where memory usage drops below 60%.
Question:
What is the ideal threshold on a metric to scale smoothly by if my baseline CPU remains roughly at an average of 6% and memory at 53%? Meaning, what is the best minimum values to scale in by and best max values to scale out without worrying about anti-patterns such as flapping? A larger threshold of 20% difference makes more sense to me.
Alternative solution:
Given the amount of troubleshooting involved with what's marketed as as simple as "push button scaling", makes it almost not even worth the headache of the configuration vagueness (you can't even use IIS metrics like connection count without a custom powershell script!). I'm considering disabling autoscaling because of its unpredictability and just keep 2 instances running for automatic load balancing and scale manually.
Autoscale Configuration:
{
"location": "East US 2",
"tags": {
"$type": "Microsoft.WindowsAzure.Management.Common.Storage.CasePreservedDictionary, Microsoft.WindowsAzure.Management.Common.Storage"
},
"properties": {
"name": "CPU and Memory Autoscale",
"enabled": true,
"targetResourceUri": "/redacted",
"profiles": [
{
"name": "Auto created scale condition",
"capacity": {
"minimum": "1",
"maximum": "10",
"default": "1"
},
"rules": [
{
"scaleAction": {
"direction": "Increase",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT10M"
},
"metricTrigger": {
"metricName": "MemoryPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "GreaterThanOrEqual",
"statistic": "Average",
"threshold": 80,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT1H"
}
},
{
"scaleAction": {
"direction": "Decrease",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT5M"
},
"metricTrigger": {
"metricName": "MemoryPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "LessThanOrEqual",
"statistic": "Average",
"threshold": 60,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT10M"
}
},
{
"scaleAction": {
"direction": "Increase",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT5M"
},
"metricTrigger": {
"metricName": "CpuPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "GreaterThanOrEqual",
"statistic": "Average",
"threshold": 60,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT1H"
}
},
{
"scaleAction": {
"direction": "Decrease",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT5M"
},
"metricTrigger": {
"metricName": "CpuPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "LessThanOrEqual",
"statistic": "Average",
"threshold": 40,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT10M"
}
}
]
}
],
"notifications": [
{
"operation": "Scale",
"email": {
"sendToSubscriptionAdministrator": false,
"sendToSubscriptionCoAdministrators": false,
"customEmails": [
"redacted"
]
},
"webhooks": []
}
],
"targetResourceLocation": "East US 2"
},
"id": "/redacted",
"name": "CPU and Memory Autoscale",
"type": "Microsoft.Insights/autoscaleSettings"
}

For the CpuPercentage metric you have a SCALE UP action when it goes beyond 60 and a scale down action when it goes below 40 and the difference between the two is very less. This can cause a behavior described as Flapping and this will cause AutoScale's scale in action not to kick in. Similar issue is the MemoryPercent rule that you have configured.
You should have a difference of at-least 40 between your scale up and scale in threasholds to avoid flapping. More details on Flapping are in https://learn.microsoft.com/en-us/azure/monitoring-and-diagnostics/insights-autoscale-best-practices#choose-the-thresholds-carefully-for-all-metric-types (search for the word Flapping)

I have exactly the same problem and I've come to believe that autoscaling back to one instance like we want to do it currently not possible.
My current workaround is to scale in to 1 instance with a second profile that repeats every day between 23:55 and 00:00.
Just to reiterate the problem. I have the following scenario. It is basically identical to yours.
Memory baseline of the App Service is 50%
Scale out 1 instance when avg(memory) > 80%
Scale in 1 instance when avg(memory) < 60%
Scaling out from 1 instance to 2 instances will work correctly when the average memory percentage exceeds 80%. But scaling in to 1 instance will never work because the memory baseline is too high.
After reading the Best Practices, my understanding is that when scaling in, it will estimate the resulting memory percentage and check if no scale out rule is triggered.
So if the average memory percentage drops to 50% for two instances the scale in rule is triggered and it will estimate the resulting memory usage to be 2 * 50% / 1 = 100% which will of course trigger the scale out rule and thus it will not scale in.
It should however work when scaling from 3 to 2 instances: 3 * 50% / 2 = 75% which is smaller than the 80% of the scale out rule.

I have the same issue here. My App need only one instance and I have a auto scaling configuration like:
Scale out
When br-empresa (Average) CpuPercentage > 85 Increase instance count by 1
Or Br-Empresa (Average) MemoryPercentage > 85 Increase instance count by 1
Scale in
When br-empresa (Average) CpuPercentage <= 75 Decrease instance count by 1
And Br-Empresa (Average) MemoryPercentage <= 75 Decrease instance count by 1
And the baseline for memory is 60%.
The Scale Out logic works pretty. But the app never scale in even if the memory falls to 60%. (60% * 2) / 1 = 120%
For memory or cpu metrics the actual flapping estimate doesn't make sense.

Related

Grafana / Azure Monitor dimension filtering without splitting

I'm trying to display "Dependency duration" from Azure Monitor (Application Insights) in Grafana. I want to exclude "Azure Service Bus" from "dependency/type" dimension.
When I do this in Azure Monitor I get all dependencies as single value:
When I try to apply same filter in Grafana, all dimensions get split:
How can I avoid splitting dimensions, or alternatively how can I combine them back into one? Relevant part of code below. I tried removing "dimensionFilter": "*" but it did not change anything.
{
"azureMonitor": {
"dimensionFilter": "*",
"dimensionFilters": [
{
"dimension": "dependency/type",
"filter": "Azure Service Bus",
"operator": "ne"
}
],
"dimensions": [
{
"text": "Dependency type",
"value": "dependency/type"
}
],
"metricDefinition": "Microsoft.Insights/components",
"metricName": "dependencies/duration",
"metricNamespace": "microsoft.insights/components",
"resourceGroup": "$resources_rg",
"resourceName": "$app_insights"
]
}
}

Slow Elasticsearch indexing using join datatype

We have an index with a join datatype and the indexing speed is very slow.
At best we are indexing 100/sec, but mostly around 50/sec, the times is varying depending of the document size. We are using multiple threads with .NET Nest when indexing but both batching and single inserts are pretty slow. We have tested various batch sizes but still not getting any speed to talk about. Even with only small documents containing "metadata" it is slow, but speed will drop radically when the document size is increasing. Document size in this solution can vary from small up to 6 MB
What can we expect using the join datatype and indexing? How much penalty must we expect to get using it? We did of course try to avoid this when designing it, but we did not find any way around it. Any tips or tricks?
We are using a 3-node cluster in Azure, all with 32 GB of RAM and premium SSD disks. The Java Heap size is set to 16GB. Swapping is Disabled. Memory usage on the VM’s is stable about 60% of total, but the CPU is very low < 10 %. We are running Elasticsearch v. 6.2.3.
A short version of the mapping:
"mappings": {
"log": {
"_routing": {
"required": true
},
"properties": {
"body": {
"type": "text"
},
"description": {
"type": "text"
},
"headStepJoinField": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"head": "step"
}
},
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"statusId": {
"type": "keyword"
},
"stepId": {
"type": "keyword"
}
}
}
}

Azure SQL failover group, what does the grace period mean?

I am currently reading this: https://learn.microsoft.com/en-us/azure/sql-database/sql-database-auto-failover-group, and I have a hard time understanding the automatic failover policy:
By default, a failover group is configured with an automatic failover
policy. The SQL Database service triggers failover after the failure
is detected and the grace period has expired. The system must verify
that the outage cannot be mitigated by the built-in high availability
infrastructure of the SQL Database service due to the scale of the
impact. If you want to control the failover workflow from the
application, you can turn off automatic failover.
When defining the failover group in an ARM template:
{
"condition": "[equals(parameters('redundancyId'), 'pri')]",
"type": "Microsoft.Sql/servers",
"kind": "v12.0",
"name": "[variables('sqlServerPrimaryName')]",
"apiVersion": "2014-04-01-preview",
"location": "[parameters('location')]",
"properties": {
"administratorLogin": "[parameters('sqlServerPrimaryAdminUsername')]",
"administratorLoginPassword": "[parameters('sqlServerPrimaryAdminPassword')]",
"version": "12.0"
},
"resources": [
{
"condition": "[equals(parameters('redundancyId'), 'pri')]",
"apiVersion": "2015-05-01-preview",
"type": "failoverGroups",
"name": "[variables('sqlFailoverGroupName')]",
"properties": {
"serverName": "[variables('sqlServerPrimaryName')]",
"partnerServers": [
{
"id": "[resourceId('Microsoft.Sql/servers/', variables('sqlServerSecondaryName'))]"
}
],
"readWriteEndpoint": {
"failoverPolicy": "Automatic",
"failoverWithDataLossGracePeriodMinutes": 60
},
"readOnlyEndpoint": {
"failoverPolicy": "Disabled"
},
"databases": [
"[resourceId('Microsoft.Sql/servers/databases', variables('sqlServerPrimaryName'), variables('sqlDatabaseName'))]"
]
},
"dependsOn": [
"[variables('sqlServerPrimaryName')]",
"[resourceId('Microsoft.Sql/servers/databases', variables('sqlServerPrimaryName'), variables('sqlDatabaseName'))]",
"[resourceId('Microsoft.Sql/servers', variables('sqlServerSecondaryName'))]"
]
},
{
"condition": "[equals(parameters('redundancyId'), 'pri')]",
"name": "[variables('sqlDatabaseName')]",
"type": "databases",
"apiVersion": "2014-04-01-preview",
"location": "[parameters('location')]",
"dependsOn": [
"[variables('sqlServerPrimaryName')]"
],
"properties": {
"edition": "[variables('sqlDatabaseEdition')]",
"requestedServiceObjectiveName": "[variables('sqlDatabaseServiceObjective')]"
}
}
]
},
{
"condition": "[equals(parameters('redundancyId'), 'pri')]",
"type": "Microsoft.Sql/servers",
"kind": "v12.0",
"name": "[variables('sqlServerSecondaryName')]",
"apiVersion": "2014-04-01-preview",
"location": "[variables('sqlServerSecondaryRegion')]",
"properties": {
"administratorLogin": "[parameters('sqlServerSecondaryAdminUsername')]",
"administratorLoginPassword": "[parameters('sqlServerSecondaryAdminPassword')]",
"version": "12.0"
}
}
I specify the readWriteEndpoint like this:
"readWriteEndpoint": {
"failoverPolicy": "Automatic",
"failoverWithDataLossGracePeriodMinutes": 60
}
With a failoverWithDataLossGracePeriodMinutes set to 60 minutes.
What does this mean? I cannot find a clear answer anywhere. Does it mean that:
When an outage is happening in my primary region where my primary database resides, the read/write endpoint points to the primary and only after 60 minutes it fails over to my secondary, which becomes the new primary. In the 60 minutes, the only way to read my data is to use the readOnlyEndpoint directly? OR
My read/write endpoint is turned instantly, if they somehow can detect that there was no data to be synced
I think it boils down to: do I have to manually make the failover, if I detect an outage, if I don't care about data loss, but I want to be able to write to my database?
Bonus question: is the reason why the grace period is present because there can be unsynced data on the primary, that will be overwritten, or tossed away, if the secondary becomes the new primary (if i switch manually)?
Sorry, I can't keep it to only one question. I have read a lot and I really need to know this.
What does this mean?
It means that:
"when a outage is happening in my primary region where my primary database resides, the read/write endpoint points to the primary and only after 60 minutes it fails over to my secondary, which becomes the new primary. "
It can't failover automatically even when the data is synced because the high-availability solution in the primary region is trying to do the same thing, and almost all of the time your primary database will come back quickly in the primary region. Performing an automatic cross-region fail-over would interfere with this.
And
"the reason why the grace period is present, is that because the there can be unsynced data on the primary, that will be overwritten, or tossed away, if the secondary becomes the new primary"
And to allow time for the database to failover within the primary region.

Azure Web App not autoscaling in time

UPDATE: It appears to be working now.
I have configured our Azure Web App to scale between 4 and 10 instances for CPU load over 80% and under 60%.
Our site have now been at over 95% CPU load for over two hours and no autoscaling has occurred.
When looking at the "schedule and performance rules" I see that the Duration (minutes) is 300.
I feel that this should be 10min instead but when I set it and save (with valid validation rules) I get this error:
Have I done something wrong or is there a bug in the portal?
After I manually increadsed to 5 and then decreased back to 4 I can see that autoscaling is working in the management services log:
ActiveAutoscaleProfile: { "Name": "Default", "Capacity": {
"Minimum": "2",
"Maximum": "10",
"Default": "2" }, "Rules": [
{
"MetricTrigger": {
"Name": "CpuPercentage",
"Namespace": "",
"Resource": "xxx",
"ResourceLocation": "West Europe",
"TimeGrain": "PT1H",
"Statistic": "Average",
"TimeWindow": "PT5H",
"TimeAggregation": "Average",
"Operator": "GreaterThanOrEqual",
"Threshold": 80.0,
"Source": "xxx"
},
"ScaleAction": {
"Direction": "Increase",
"Type": "ChangeCount",
"Value": "1",
"Cooldown": "PT5M"
}
},
{
"MetricTrigger": {
"Name": "CpuPercentage",
"Namespace": "",
"Resource": "xxx",
"ResourceLocation": "West Europe",
"TimeGrain": "PT1H",
"Statistic": "Average",
"TimeWindow": "PT5H",
"TimeAggregation": "Average",
"Operator": "LessThan",
"Threshold": 60.0,
"Source": "xxx"
},
"ScaleAction": {
"Direction": "Decrease",
"Type": "ChangeCount",
"Value": "1",
"Cooldown": "PT5M"
}
} ] }
Description: The autoscale engine attempting to scale resource xxx' from 3 instances count to 2 instances count.
LastScaleActionTime: Wed, 03 Jun 2015 09:11:38 GMT
Microsoft.Resources/EventNameV2: Autoscale a resource.
Microsoft.Resources/Operation: Scale down
Microsoft.Resources/ResourceUri: /xxx
NewInstancesCount: 2
OldInstancesCount: 3
ResourceName: xxx
so I can see that autoscaling indeed works.
Can the value be changed programmatically?
This appears to be a bug in the preview portal. I gave feedback on this here if you want to vote it up.
The problem has to do with the TimeGrain property that exists in the MetricTrigger as part of the autoscale rule. It appears the preview portal defaults this value to 1 hour ("PT1H") with no way to change it. This prevents you from setting a Duration in the portal to a value less than 60 minutes.
As a workaround, if you use the current portal at https://manage.windowsazure.com, and configure autoscale by CPU for your web app there, then return back to the preview portal, you will be able to set your Duration to as low as 5 minutes.
Finally, to answer your question about setting this programmatically. Yes, this is possible using the management libraries. I show how to do this here for a cloud service. But, it should be the same (or very similar) for web apps. This was over a year ago though so it may not work 100% as I wrote about, but it looks like the MetricTrigger class is still basically the same and that is where most of your attention will be.

Node.js memory leak, despite constant Heap + RSS sizes

According to my server monitoring, my memory usage is creeping up over time:
After ~4 weeks of uptime, it ends up causing problems / crashing (which makes sense, given that I'm on EC2 with m1.large instances => 8GB RAM, and RAM seems to be increasing at about 1.5 GB / week).
If I restart my node.js app, the memory usage resets.
Yet... I'm keeping track of my memory usage via process.memoryUsage(), and even after ~1 week, I'm seeing
{"rss":"693 Mb","heapTotal":"120 Mb","heapUsed":"79 Mb"}
What am I missing? Clearly the leak is in node, yet the process seems to not be aware of it...
You can try node-memwatch module, which helps with leak detection and heap diffing in Node.
Heap diff would look similar to:
{
"before": { "nodes": 11625, "size_bytes": 1869904, "size": "1.78 mb" },
"after": { "nodes": 21435, "size_bytes": 2119136, "size": "2.02 mb" },
"change": { "size_bytes": 249232, "size": "243.39 kb", "freed_nodes": 197,
"allocated_nodes": 10007,
"details": [
{ "what": "String",
"size_bytes": -2120, "size": "-2.07 kb", "+": 3, "-": 62
},
{ "what": "Array",
"size_bytes": 66687, "size": "65.13 kb", "+": 4, "-": 78
},
{ "what": "LeakingClass",
"size_bytes": 239952, "size": "234.33 kb", "+": 9998, "-": 0
}
]
}

Resources