UPDATE: It appears to be working now.
I have configured our Azure Web App to scale between 4 and 10 instances for CPU load over 80% and under 60%.
Our site have now been at over 95% CPU load for over two hours and no autoscaling has occurred.
When looking at the "schedule and performance rules" I see that the Duration (minutes) is 300.
I feel that this should be 10min instead but when I set it and save (with valid validation rules) I get this error:
Have I done something wrong or is there a bug in the portal?
After I manually increadsed to 5 and then decreased back to 4 I can see that autoscaling is working in the management services log:
ActiveAutoscaleProfile: { "Name": "Default", "Capacity": {
"Minimum": "2",
"Maximum": "10",
"Default": "2" }, "Rules": [
{
"MetricTrigger": {
"Name": "CpuPercentage",
"Namespace": "",
"Resource": "xxx",
"ResourceLocation": "West Europe",
"TimeGrain": "PT1H",
"Statistic": "Average",
"TimeWindow": "PT5H",
"TimeAggregation": "Average",
"Operator": "GreaterThanOrEqual",
"Threshold": 80.0,
"Source": "xxx"
},
"ScaleAction": {
"Direction": "Increase",
"Type": "ChangeCount",
"Value": "1",
"Cooldown": "PT5M"
}
},
{
"MetricTrigger": {
"Name": "CpuPercentage",
"Namespace": "",
"Resource": "xxx",
"ResourceLocation": "West Europe",
"TimeGrain": "PT1H",
"Statistic": "Average",
"TimeWindow": "PT5H",
"TimeAggregation": "Average",
"Operator": "LessThan",
"Threshold": 60.0,
"Source": "xxx"
},
"ScaleAction": {
"Direction": "Decrease",
"Type": "ChangeCount",
"Value": "1",
"Cooldown": "PT5M"
}
} ] }
Description: The autoscale engine attempting to scale resource xxx' from 3 instances count to 2 instances count.
LastScaleActionTime: Wed, 03 Jun 2015 09:11:38 GMT
Microsoft.Resources/EventNameV2: Autoscale a resource.
Microsoft.Resources/Operation: Scale down
Microsoft.Resources/ResourceUri: /xxx
NewInstancesCount: 2
OldInstancesCount: 3
ResourceName: xxx
so I can see that autoscaling indeed works.
Can the value be changed programmatically?
This appears to be a bug in the preview portal. I gave feedback on this here if you want to vote it up.
The problem has to do with the TimeGrain property that exists in the MetricTrigger as part of the autoscale rule. It appears the preview portal defaults this value to 1 hour ("PT1H") with no way to change it. This prevents you from setting a Duration in the portal to a value less than 60 minutes.
As a workaround, if you use the current portal at https://manage.windowsazure.com, and configure autoscale by CPU for your web app there, then return back to the preview portal, you will be able to set your Duration to as low as 5 minutes.
Finally, to answer your question about setting this programmatically. Yes, this is possible using the management libraries. I show how to do this here for a cloud service. But, it should be the same (or very similar) for web apps. This was over a year ago though so it may not work 100% as I wrote about, but it looks like the MetricTrigger class is still basically the same and that is where most of your attention will be.
Related
I'm trying to display "Dependency duration" from Azure Monitor (Application Insights) in Grafana. I want to exclude "Azure Service Bus" from "dependency/type" dimension.
When I do this in Azure Monitor I get all dependencies as single value:
When I try to apply same filter in Grafana, all dimensions get split:
How can I avoid splitting dimensions, or alternatively how can I combine them back into one? Relevant part of code below. I tried removing "dimensionFilter": "*" but it did not change anything.
{
"azureMonitor": {
"dimensionFilter": "*",
"dimensionFilters": [
{
"dimension": "dependency/type",
"filter": "Azure Service Bus",
"operator": "ne"
}
],
"dimensions": [
{
"text": "Dependency type",
"value": "dependency/type"
}
],
"metricDefinition": "Microsoft.Insights/components",
"metricName": "dependencies/duration",
"metricNamespace": "microsoft.insights/components",
"resourceGroup": "$resources_rg",
"resourceName": "$app_insights"
]
}
}
I am currently reading this: https://learn.microsoft.com/en-us/azure/sql-database/sql-database-auto-failover-group, and I have a hard time understanding the automatic failover policy:
By default, a failover group is configured with an automatic failover
policy. The SQL Database service triggers failover after the failure
is detected and the grace period has expired. The system must verify
that the outage cannot be mitigated by the built-in high availability
infrastructure of the SQL Database service due to the scale of the
impact. If you want to control the failover workflow from the
application, you can turn off automatic failover.
When defining the failover group in an ARM template:
{
"condition": "[equals(parameters('redundancyId'), 'pri')]",
"type": "Microsoft.Sql/servers",
"kind": "v12.0",
"name": "[variables('sqlServerPrimaryName')]",
"apiVersion": "2014-04-01-preview",
"location": "[parameters('location')]",
"properties": {
"administratorLogin": "[parameters('sqlServerPrimaryAdminUsername')]",
"administratorLoginPassword": "[parameters('sqlServerPrimaryAdminPassword')]",
"version": "12.0"
},
"resources": [
{
"condition": "[equals(parameters('redundancyId'), 'pri')]",
"apiVersion": "2015-05-01-preview",
"type": "failoverGroups",
"name": "[variables('sqlFailoverGroupName')]",
"properties": {
"serverName": "[variables('sqlServerPrimaryName')]",
"partnerServers": [
{
"id": "[resourceId('Microsoft.Sql/servers/', variables('sqlServerSecondaryName'))]"
}
],
"readWriteEndpoint": {
"failoverPolicy": "Automatic",
"failoverWithDataLossGracePeriodMinutes": 60
},
"readOnlyEndpoint": {
"failoverPolicy": "Disabled"
},
"databases": [
"[resourceId('Microsoft.Sql/servers/databases', variables('sqlServerPrimaryName'), variables('sqlDatabaseName'))]"
]
},
"dependsOn": [
"[variables('sqlServerPrimaryName')]",
"[resourceId('Microsoft.Sql/servers/databases', variables('sqlServerPrimaryName'), variables('sqlDatabaseName'))]",
"[resourceId('Microsoft.Sql/servers', variables('sqlServerSecondaryName'))]"
]
},
{
"condition": "[equals(parameters('redundancyId'), 'pri')]",
"name": "[variables('sqlDatabaseName')]",
"type": "databases",
"apiVersion": "2014-04-01-preview",
"location": "[parameters('location')]",
"dependsOn": [
"[variables('sqlServerPrimaryName')]"
],
"properties": {
"edition": "[variables('sqlDatabaseEdition')]",
"requestedServiceObjectiveName": "[variables('sqlDatabaseServiceObjective')]"
}
}
]
},
{
"condition": "[equals(parameters('redundancyId'), 'pri')]",
"type": "Microsoft.Sql/servers",
"kind": "v12.0",
"name": "[variables('sqlServerSecondaryName')]",
"apiVersion": "2014-04-01-preview",
"location": "[variables('sqlServerSecondaryRegion')]",
"properties": {
"administratorLogin": "[parameters('sqlServerSecondaryAdminUsername')]",
"administratorLoginPassword": "[parameters('sqlServerSecondaryAdminPassword')]",
"version": "12.0"
}
}
I specify the readWriteEndpoint like this:
"readWriteEndpoint": {
"failoverPolicy": "Automatic",
"failoverWithDataLossGracePeriodMinutes": 60
}
With a failoverWithDataLossGracePeriodMinutes set to 60 minutes.
What does this mean? I cannot find a clear answer anywhere. Does it mean that:
When an outage is happening in my primary region where my primary database resides, the read/write endpoint points to the primary and only after 60 minutes it fails over to my secondary, which becomes the new primary. In the 60 minutes, the only way to read my data is to use the readOnlyEndpoint directly? OR
My read/write endpoint is turned instantly, if they somehow can detect that there was no data to be synced
I think it boils down to: do I have to manually make the failover, if I detect an outage, if I don't care about data loss, but I want to be able to write to my database?
Bonus question: is the reason why the grace period is present because there can be unsynced data on the primary, that will be overwritten, or tossed away, if the secondary becomes the new primary (if i switch manually)?
Sorry, I can't keep it to only one question. I have read a lot and I really need to know this.
What does this mean?
It means that:
"when a outage is happening in my primary region where my primary database resides, the read/write endpoint points to the primary and only after 60 minutes it fails over to my secondary, which becomes the new primary. "
It can't failover automatically even when the data is synced because the high-availability solution in the primary region is trying to do the same thing, and almost all of the time your primary database will come back quickly in the primary region. Performing an automatic cross-region fail-over would interfere with this.
And
"the reason why the grace period is present, is that because the there can be unsynced data on the primary, that will be overwritten, or tossed away, if the secondary becomes the new primary"
And to allow time for the database to failover within the primary region.
We have some Azure Table storage tables in our subscription and would like to migrate them to CosmosDB table API due to performance reasons. To do this, I started creating cosmos db account by selecting Table API but my deployment failed with the following error. When i tried with SQL API, it works.
{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. "details":[{"code":"BadRequest","message":"{\r\n \"code\": \"BadRequest\",\r\n \"message\": \"CORS rules are not supported for this API\rMicrosoft.Azure.Documents.Common/2.1.0.0\"\r\n}"}]}
Can someone please let me know what could be the reason for this?
#AngiSen, may be related to a recent (breaking) update of Azure Cosmos DB resource provider (Microsoft.DocumentDb/databaseAccounts) as I just noticed today (28th of Nov 2018) that a previously running deployment (as of 23th of Nov 2018) of Cosmos DB Table API is now failing with this same error:
9:16:23 AM - Resource Microsoft.DocumentDb/databaseAccounts 'xxx-xxx-xxx' failed with message '{
"code": "BadRequest",
"message": "CORS rules are not supported for this API\r\nActivityId: xxx, Microsoft.Azure.Documents.Common/2.1.0.0"
}'
In my case I'm using 2015-04-08 version with Table API but I don't configure explicitly the CORS part and anyway there's no such configuration option in the resource provider.
Digging into the existing Cosmos DB instance with https://resources.azure.com shows there's indeed a CORS member that is part of the definition:
{
"id": "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.DocumentDB/databaseAccounts/xxx",
"name": "xxx",
"location": "North Europe",
"type": "Microsoft.DocumentDB/databaseAccounts",
"kind": "GlobalDocumentDB",
"tags": {},
"properties": {
"provisioningState": "Succeeded",
"documentEndpoint": "https://xxx.documents.azure.com:443/",
"tableEndpoint": "https://xxx.table.cosmosdb.azure.com:443/",
"ipRangeFilter": "",
"enableAutomaticFailover": false,
"enableMultipleWriteLocations": false,
"isVirtualNetworkFilterEnabled": false,
"virtualNetworkRules": [],
"EnabledApiTypes": "Table, Sql",
"databaseAccountOfferType": "Standard",
"consistencyPolicy": {
"defaultConsistencyLevel": "BoundedStaleness",
"maxIntervalInSeconds": 86400,
"maxStalenessPrefix": 1000000
},
"configurationOverrides": {},
"writeLocations": [
{
"id": "xxx-northeurope",
"locationName": "North Europe",
"documentEndpoint": "https://xxx-northeurope.documents.azure.com:443/",
"provisioningState": "Succeeded",
"failoverPriority": 0
}
],
"readLocations": [
{
"id": "xxx-northeurope",
"locationName": "North Europe",
"documentEndpoint": "https://xxx-northeurope.documents.azure.com:443/",
"provisioningState": "Succeeded",
"failoverPriority": 0
}
],
"locations": [
{
"id": "xxx-northeurope",
"locationName": "North Europe",
"documentEndpoint": "https://xxx-northeurope.documents.azure.com:443/",
"provisioningState": "Succeeded",
"failoverPriority": 0
}
],
"failoverPolicies": [
{
"id": "xxx-northeurope",
"locationName": "North Europe",
"failoverPriority": 0
}
],
"cors": [],
"capabilities": [
{
"name": "EnableTable"
}
]
}
}
Hope it'll get fixed quickly if it's indeed a breaking change...
Wanted to make an official statement here. I have spoken with the Cosmos DB team and they have a fix ready and it should be deployed tonight. Please let me know if you should have any questions. Thank you for posting the issue.
My app service has failed to scale-in after scaling-out. This seems to be a pattern I've been trying to troubleshoot for several months.
I've tried the following but none have worked:
My scale condition is based on CPU and memory. However, I've never seen CPU go past 12%, so I'm assuming it's actually scaling based on memory.
Set the scale out condition to memory over 90% over a 5 minute average with 10 min. cooldown and scale in condition for memory under 70% over a 5 minute average. This doesn't seem to make sense since if my memory utilization is already at 90%, I'm really having underlying memory leaks and should have already scaled out.
Set the scale out condition to memory over 80% over a 60 minute average with 10 min. cooldown and scale in condition for memory under 60% over a 5 minute average. This makes more sense, as I've seen memory usage burst over a few hours only to drop.
Expected behavior: App service autoscaling will reduce instance count after 5 minutes where memory usage drops below 60%.
Question:
What is the ideal threshold on a metric to scale smoothly by if my baseline CPU remains roughly at an average of 6% and memory at 53%? Meaning, what is the best minimum values to scale in by and best max values to scale out without worrying about anti-patterns such as flapping? A larger threshold of 20% difference makes more sense to me.
Alternative solution:
Given the amount of troubleshooting involved with what's marketed as as simple as "push button scaling", makes it almost not even worth the headache of the configuration vagueness (you can't even use IIS metrics like connection count without a custom powershell script!). I'm considering disabling autoscaling because of its unpredictability and just keep 2 instances running for automatic load balancing and scale manually.
Autoscale Configuration:
{
"location": "East US 2",
"tags": {
"$type": "Microsoft.WindowsAzure.Management.Common.Storage.CasePreservedDictionary, Microsoft.WindowsAzure.Management.Common.Storage"
},
"properties": {
"name": "CPU and Memory Autoscale",
"enabled": true,
"targetResourceUri": "/redacted",
"profiles": [
{
"name": "Auto created scale condition",
"capacity": {
"minimum": "1",
"maximum": "10",
"default": "1"
},
"rules": [
{
"scaleAction": {
"direction": "Increase",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT10M"
},
"metricTrigger": {
"metricName": "MemoryPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "GreaterThanOrEqual",
"statistic": "Average",
"threshold": 80,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT1H"
}
},
{
"scaleAction": {
"direction": "Decrease",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT5M"
},
"metricTrigger": {
"metricName": "MemoryPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "LessThanOrEqual",
"statistic": "Average",
"threshold": 60,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT10M"
}
},
{
"scaleAction": {
"direction": "Increase",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT5M"
},
"metricTrigger": {
"metricName": "CpuPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "GreaterThanOrEqual",
"statistic": "Average",
"threshold": 60,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT1H"
}
},
{
"scaleAction": {
"direction": "Decrease",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT5M"
},
"metricTrigger": {
"metricName": "CpuPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "LessThanOrEqual",
"statistic": "Average",
"threshold": 40,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT10M"
}
}
]
}
],
"notifications": [
{
"operation": "Scale",
"email": {
"sendToSubscriptionAdministrator": false,
"sendToSubscriptionCoAdministrators": false,
"customEmails": [
"redacted"
]
},
"webhooks": []
}
],
"targetResourceLocation": "East US 2"
},
"id": "/redacted",
"name": "CPU and Memory Autoscale",
"type": "Microsoft.Insights/autoscaleSettings"
}
For the CpuPercentage metric you have a SCALE UP action when it goes beyond 60 and a scale down action when it goes below 40 and the difference between the two is very less. This can cause a behavior described as Flapping and this will cause AutoScale's scale in action not to kick in. Similar issue is the MemoryPercent rule that you have configured.
You should have a difference of at-least 40 between your scale up and scale in threasholds to avoid flapping. More details on Flapping are in https://learn.microsoft.com/en-us/azure/monitoring-and-diagnostics/insights-autoscale-best-practices#choose-the-thresholds-carefully-for-all-metric-types (search for the word Flapping)
I have exactly the same problem and I've come to believe that autoscaling back to one instance like we want to do it currently not possible.
My current workaround is to scale in to 1 instance with a second profile that repeats every day between 23:55 and 00:00.
Just to reiterate the problem. I have the following scenario. It is basically identical to yours.
Memory baseline of the App Service is 50%
Scale out 1 instance when avg(memory) > 80%
Scale in 1 instance when avg(memory) < 60%
Scaling out from 1 instance to 2 instances will work correctly when the average memory percentage exceeds 80%. But scaling in to 1 instance will never work because the memory baseline is too high.
After reading the Best Practices, my understanding is that when scaling in, it will estimate the resulting memory percentage and check if no scale out rule is triggered.
So if the average memory percentage drops to 50% for two instances the scale in rule is triggered and it will estimate the resulting memory usage to be 2 * 50% / 1 = 100% which will of course trigger the scale out rule and thus it will not scale in.
It should however work when scaling from 3 to 2 instances: 3 * 50% / 2 = 75% which is smaller than the 80% of the scale out rule.
I have the same issue here. My App need only one instance and I have a auto scaling configuration like:
Scale out
When br-empresa (Average) CpuPercentage > 85 Increase instance count by 1
Or Br-Empresa (Average) MemoryPercentage > 85 Increase instance count by 1
Scale in
When br-empresa (Average) CpuPercentage <= 75 Decrease instance count by 1
And Br-Empresa (Average) MemoryPercentage <= 75 Decrease instance count by 1
And the baseline for memory is 60%.
The Scale Out logic works pretty. But the app never scale in even if the memory falls to 60%. (60% * 2) / 1 = 120%
For memory or cpu metrics the actual flapping estimate doesn't make sense.
I have created an Activity Log Alert in Azure that does a custom log search against an Application Insights instance.
The alert is working and action groups is notified through the channels I have set up.
The problem I'm having is to create that alert in the arm template we are using to deploy the resources.
When looking at the automation script in the portal the alerts are left out and is not visible. (microsoft.insights/scheduledqueryrules)
I can't find any information online on how to write the condition in the template so it works with a custom log search.
Any suggestions where to find info on how to write the condition or how to extract the template from the portal for those alerts.
This is an ARM template part that creates an alert with a scheduled query. It also adds an array of action groups that get notified when the alert is triggered:
{
"name": "[parameters('scheduleQueryMonitorApplicationError')]",
"type": "microsoft.insights/scheduledqueryrules",
"apiVersion": "2018-04-16",
"location": "[resourceGroup().location]",
"tags": {
"[concat('hidden-link:', resourceGroup().id, '/resourceGroups/', parameters('resourceGroupName'), '/providers/microsoft.insights/components/', parameters('applicationInsightsName'))]": "Resource"
},
"properties": {
"description": "[parameters('scheduleQueryMonitorApplicationError')]",
"enabled": "true",
"source": {
"query": "traces | where severityLevel == 3",
"queryType": "ResultCount",
"dataSourceId": "[resourceId('microsoft.insights/components', parameters('applicationInsightsName'))]"
},
"schedule": {
"frequencyInMinutes": 5,
"timeWindowInMinutes": 5
},
"action": {
"odata.type": "Microsoft.WindowsAzure.Management.Monitoring.Alerts.Models.Microsoft.AppInsights.Nexus.DataContracts.Resources.ScheduledQueryRules.AlertingAction",
"severity": "3",
"aznsAction": {
"actionGroup": "[array( resourceId('microsoft.insights/actiongroups', parameters('actionGroupName')) )]"
},
"trigger": {
"threshold": 1,
"thresholdOperator": "GreaterThan"
}
}
},
"dependsOn": [
"[resourceId('microsoft.insights/components', parameters('applicationInsightsName'))]"
]
},
Please see this stackoverflow thread, where a similar question was asked. Elfocrash mentions that he wrote a blog post about that, explaining how it works. I tried his method and it works.