Unable to scale Azure VMSS beyond 15 instances - azure

We have a few Azure VMSS deployments in our subscription, created over the course of a couple years. The earlier one we created, we can scale up to about 50 instances, but the one created a month ago is unable to scale past 15 instances. When trying to do so, I get:
Failed to update autoscale configuration for 't0'. {
"error": {
"details": [],
"code": "InboundNatPoolFrontendPortRangeSmallerThanRequestedPorts",
"message": "The frontend port range for the inboundNATpool /subscriptions/a48dea64-5847-4d79-aaa6-036530430809/resourceGroups/int-aowappcompat/providers/Microsoft.Network/loadBalancers/LB-int-aowappcompat-t0/inboundNatPools/EtwListenerNatPool-qs8az5dmgu is smaller than the requested number of ports 28 in VM scale set /subscriptions/a48dea64-5847-4d79-aaa6-036530430809/resourceGroups/INT-AOWAPPCOMPAT/providers/Microsoft.Compute/virtualMachineScaleSets/t0."
}
}.
I've tried to find an answer to how to fix this, but there's virtually nothing out there for InboundNatPoolFrontendPortRangeSmallerThanRequestedPorts except for an unhelpful StackOverflow answer. We've gone through the ARM template as well as all the various UIs for the load balancers, public ip addresses, etc as well as diff'd the old and new ARM templates trying to find the source of the disparity, with no luck. Unfortunately, I'm not a networking whiz, so my knowledge here is fairly shallow.
UPDATE: Here's a relevant (maybe?) from my template:
"inboundNatPools": [{
"name": "LoadBalancerBEAddressNatPool",
"properties": {
"backendPort": "3389",
"frontendIPConfiguration": {
"id": "[variables('lbIPConfig0')]"
},
"frontendPortRangeEnd": "4500",
"frontendPortRangeStart": "3389",
"protocol": "tcp"
}
}
]
I believe this is not a duplicate of the referenced question, as my problem doesn't seem to be an issue with the port range being too small. As you can see from the snippet, the range covers 1111 ports, and I can't even scale to 16 instances. Similarly, comments about overprovisioning don't seem relevant either, as with 1111 ports, I should be able to 16 instances without issue. Of course there may be something I'm not understanding here.
Any tips? Thanks!

Scale sets currently default to overprovisioning VMs. I also face the same issue. To avoid this, you could wait for the VM instance status to reach a successfully provisioned running status when you manually scale out. Also, you should do not exceed quota limits of Core in your subscription.

Related

Why doesn't Azure Functions use other instances in app service plan to to process data?

I have an Azure Function durable task that will spread into 12 smaller tasks. I am using dedicated plan, my maxConcurrentActivityFunctions is currently set to 4, and I have total of 3 instances (P3v2 - 4 cores) in the app service plan.
What I understand is, I should be able to process 12 concurrent tasks, and each instance should use all of its CPU to process the job, because the job is CPU oriented.
But in reality, scaling doesn't improve the performance, all of the task go to a single instance. Other 2 instances just stay idle, despite the fact that the main instance is being totally tortured and CPU usage always sit at 100% percent.
I am sure they go to the same instance because I can read that information from the log analytics. Every log has the same host instance id. If I filter out that host instance id, no logs will even get returned.
I also tested making 3 separated call, with 4 tasks in each. It also doesn't seem to use 3 instances too. The metric for the app service plan seem like, there can only be 1 instance online at a time, depite having 3 instances available. The dashline seems to mean "offline". Because when I filter by instance, it just show at 0.
Here is the host.json file
{
"version": "2.0",
"functionTimeout": "01:00:00",
"logging": {
"logLevel": {
"default": "Information"
},
"console": {
"isEnabled": "false"
},
"applicationInsights": {
"samplingSettings": {
"ExcludedTypes": "Request",
"isEnabled": true
}
}
},
"extensions": {
"durableTask": {
"hubName": "blah",
"maxConcurrentActivityFunctions": 4,
"maxConcurrentOrchestratorFunctions": 20
}
}
}
My expection is. 12 tasks should immediately begin. And 3 instances should all be busy processing the data, instead of only 1 instance with 4 concurrent task.
Am I doing anything wrong. Or am I misunderstand something here?
As far as I know and as per the Microsoft documentation Multiple applications in the same app service plan will share all the instances you have in your premium plan.
For example if you have if the app service plan is configured to run multiple VM instances, then all the apps in the plan will run on multiple instances.
In your case, the application you have is only one but that application has many sub units (functions). So the application is using only one instance.
if you want to use all the instances then try deploying multiple function apps into the same app service plan.
Also, you can use Scaling functionalities or you can set by default Auto Scaling for the app service plan

Blob-triggered Azure Function doesn't process only one blob at a time anymore

I have written a blob-triggered function that uploads data on a CosmosDB database using the Gremlin API, using Azure Functions version 2.0. Whenever the function is triggered, it is going to read the blob, extract relevant information, and then queries the database to upload the data on it.
However, when all files are uploaded on the blob storage at the same time, the Function is going to process all files at the same time, which results in too many requests for the database to handle. To avoid this, I ensured that the Azure Function would only process one file at a time, by setting the batchSize to 1 in the host.json file :
{
"extensions": {
"queues": {
"batchSize": 1,
"maxDequeueCount": 1,
"newBatchThreshold": 0
}
},
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
}
},
"version": "2.0"
}
This worked perfectly fine for 20 files at a time.
Now, we are trying to process 300 files at a time, and this feature doesn't seem to work anymore, the Function processes all the files at the same time again, which results in the database not being able to handle all the requests.
What am I missing here ? Is there some scaling issue I'm not aware of ?
From here:
If you want to avoid parallel execution for messages received on one queue, you can set batchSize to 1. However, this setting eliminates concurrency as long as your function app runs only on a single virtual machine (VM). If the function app scales out to multiple VMs, each VM could run one instance of each queue-triggered function.
You need to combine this with the app setting WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT when you run in Consumption plan.
Or, according to the docs, the better way would be through the Function property functionAppScaleLimit: https://learn.microsoft.com/en-us/azure/azure-functions/event-driven-scaling#limit-scale-out
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT would work of course.
You can also scale to multiple Function App instances within one Host then you can have less hosts and more FUNCTIONS_WORKER_PROCESS_COUNT per host. Cost implications would depend on your plan.
Note that all workers within a Host would share resources, so this is recommended for more IO bound workload.

Error when creating AKS cluster using a reference for the subnet ID

I'm getting an error when I try to deploy an AKS cluster using an ARM template, if the vnetSubnetId in the agentPoolProfiles property is a reference. I've used this exact template before without problems (on October 4th) but now I'm seeing an error with multiple different clusters, and when I do it either through a VSTS pipeline, or manually using PowerShell.
The property is set up like this:
"agentPoolProfiles": [
{
"name": "agentpool",
"count": "[parameters('agentCount')]",
"vmSize": "[parameters('agentVMSize')]",
"osType": "Linux",
"dnsPrefix": "[variables('agentsEndpointDNSNamePrefix')]",
"osDiskSizeGB": "[parameters('agentOsDiskSizeGB')]",
"vnetSubnetID": "[reference(concat('Microsoft.Network/virtualNetworks/', variables('vnetName'))).subnets[0].id]"
}
]
The variable 'vnetName' is based on an input parameter I use for the cluster name, and the vnet itself 100% exists, and is deployed as part of the same template.
If I try to deploy a new cluster I get the following error:
Message: {
"code": "InvalidParameter",
"message": "The value of parameter agentPoolProfile.vnetSubnetID is invalid.",
"target": "agentPoolProfile.vnetSubnetID"
}
If I try to re-deploy a cluster, with no changes to the template or input parameters since it last worked, I get the following error:
Message: {
"code": "PropertyChangeNotAllowed",
"message": "Changing property 'agentPoolProfile.vnetSubnetID' is not allowed.",
"target": "agentPoolProfile.vnetSubnetID"
}
Has something changed that means I can no longer get the vnet ID at runtime? Does it need to be passed in as a parameter now? If something has changed, is there anywhere I can find out the details?
Edit: Just to clarify, for re-deploying a cluster, I have checked and there are no new subnets, and I'm seeing the same behavior on 3 different clusters with different VNets.
Switching from reference() to resourceId() did fix the problem so has been marked as the answer, but I'm still no clearer on why reference() stopped working, will update that here as well if I figure it out.
I think what happened is subnets[0].id returns wrong (DIFFERENT) subnetId. and this is what the error points out. You cannot change the subnetId after deploying the cluster.
Probably somebody created a new subnet in the vnet. But I'd say that overall the approach is flawed. you should build the resourceId() function or just pass it as a parameter

DC/OS 1.9 VIP load balancing not working for advertised ports

When I publish a service with a VIP, the advertised address does not route properly to the advertised port. For example, for a MariaDB Galera 3-node cluster service with a VIP specified as:
"labels": {
"VIP_0": "/mariadb-galera:3306"
}
On the configuration tab of the service page (and according to the docs), the load balanced address is:
mariadb-galera.marathon.l4lb.thisdcos.directory:3306
I can ping the DNS name just fine, but...
When I try to connect a front-end service (Drupal7, wordpress) to consume this load balanced address:port combination, there will be numerous connection failures and timeouts. It isn't that it never works but that it works quite sporadically, if at all. Drupal7 dies almost immediately and starts kicking up Bad Gateway errors.
What I have found through experimentation is that if I specify a hostPort for the service in question, the load balanced address will work as long as I use the hostPort value, and not the advertised load balanced service port as above. In this specific case I specified a hostPort of 3310.
"network":"USER",
"portMappings": [
{
"containerPort": 3306,
"hostPort": 3310,
"servicePort": 10000,
"name": "mariadb-galera",
"labels": {
"VIP_0": "/mariadb-galera:3306"
}
}
Then if I use the load balanced address (mariadb-galera.marathon.l4lb.thisdcos.directory) with the host port value (3310) in my Drupal7 settings.php, the front end connects and works fine.
I've noticed similar behaviour with custom applications connecting to mongodb backends also in a DC/OS environment... it seems the load balanced address/port combination specified never works reliably... but if you substitute the hostPort value, it does.
The docs clearly state that:
address and port is load balanced as a pair rather than individually.
(from https://docs.mesosphere.com/1.9/networking/dns-overview/)
Yet I am unable to effectively connect when I specify the VIP designated port. Yet IT DOES WORK when I use the hostPort (and will not work at all unless I designate a specific hostPort in the service definition json). Wether or not this approach is actually load balanced remains a question to me based on the wording in the documentation.
I must be doing something wrong, but I am at a loss... any help is appreciated.
My cluster nodes are VMWare virtual machines.
The VIP label shouldn't start with a slash:
"container": {
"portMappings": [
{
"containerPort": 3306,
"name": "mariadb-galera",
"labels": {
"VIP_0": "mariadb-galera:3306"
}
}
}
should be available as <VIP label>.marathon.l4lb.thisdcos.directory:<VIP port> in this case:
mariadb-galera.marathon.l4lb.thisdcos.directory:3306
you can test it using nc:
nc -z -w5 mariadb-galera.marathon.l4lb.thisdcos.directory 3306; echo $?
The command should return 0.
When you're not sure about exported DNS names you can list all of them from any DC/OS node:
curl -s http://198.51.100.1:63053/v1/records | grep mariadb-galera

Azure ML: Getting Error 503: NoMoreResources to any web service API even when I only make 1 request

Getting the following response even when I make one request (concurrency set to 200) to a web service.
{ status: 503, headers: '{"content-length":"174","content-type":"application/json; charset=utf-8","etag":"\"8ce068bf420a485c8096065ea3e4f436\"","server":"Microsoft-HTTPAPI/2.0","x-ms-request-id":"d5c56cdd-644f-48ba-ba2b-6eb444975e4c","date":"Mon, 15 Feb 2016 04:54:01 GMT","connection":"close"}', body: '{"error":{"code":"ServiceUnavailable","message":"Service is temporarily unavailable.","details":[{"code":"NoMoreResources","message":"No resources available for request."}]}}' }
The request-response web service is a recommender retraining web service with the training set containing close to 200k records. The training set is already present in my ML studio dataset, only 10-15 extra records are passed in the request. The same experiment was working flawlessly till 13th Feb 2016. I have already tried increasing the concurrency but still the same issue. I even reduced the size of the training set to 20 records, still didn't work.
I have two web service both doing something similar and both aren't working since 13th Feb 2016.
Finally, I created a really small experiment ( skill.csv --> split row ---> web output ) which doesn't take any input. It just has to return some part of the dataset. Did not work, response code 503.
The logs I got are as follows
{
"version": "2014-10-01",
"diagnostics": [{
.....
{
"type": "GetResourceEndEvent",
"timestamp": 13.1362,
"resourceId": "5e2d653c2b214e4dad2927210af4a436.865467b9e7c5410e9ebe829abd0050cd.v1-default-111",
"status": "Failure",
"error": "The Uri for the target storage location is not specified. Please consider changing the request's location mode."
},
{
"type": "InitializationSummary",
"time": "2016-02-15T04:46:18.3651714Z",
"status": "Failure",
"error": "The Uri for the target storage location is not specified. Please consider changing the request's location mode."
}
]
}
What am I missing? Or am I doing it completely wrong?
Thank you in advance.
PS: Data is stored in mongoDB and then imported as CSV
This was an Azure problem. I quote the Microsoft guy,
We believe we have isolated the issue impacting tour service and we are currently working on a fix. We will be able to deploy this in the next couple of days. The problem is impacting only the ASIA AzureML region at this time, so if this is an option for you, might I suggest using a workspace in either the US or EU region until the fix gets rolled out here.
To view the complete discussion, click here

Resources