Change number of upgrade domains in Azure Service Fabric - azure

In Azure Service Fabric, the default number of upgrade domains is 5. Is there a way to change to a different number?
From, there's ClusterManifest.xml, but it doesn't seem we should modify it.

This is not possible in Azure today. SF picks up the FD and UD information from the VM Scale Sets that it runs on, and today these are capped/locked at 5x5. SF itself doesn't care how many UDs you have, and generally recommends more so that during an upgrade you're taking down less of your overall service in terms of capacity and also have more time to react to any issues.
There are some workarounds:
Run multiple service fabric application instances. Since each application instance is independently upgradable, you end up with (app instances) * (# of UDs) separate upgrade boundaries
Run the cluster across VM Scale Sets in multiple Azure AZs
Unfortunately this only works in areas where Azure has multiple AZs exposed


What is the recommended minimal instance count for azure app service in production?

I was wondering what would be the minimal recommended instance count for azure app service web app in production scenario ?
We currently have app services under Isolated app service plan with Autoscale rule in place (scaling up to 10 instances).
However the minimal instance count is currently set to 1.
I was wondering if this number shouldn't be higher (at least 2)? Can this impact SLA and apps availability somehow ?
If availability is a concern you should use 2 or 3 instance to minimize impact. Having just one instance can cause downtime. If you have a web app running on 1 instance the Azure Portal warns you with the following advice:
Distributing your web app across multiple instances
The webapp is currently configured to run on only one instance.
Since you have only one instance you can expect downtime because when the App Service platform is upgraded, the instance on which your web app is running will be upgraded. Therefore, your web app process will be restarted and will experience downtime.
Having 2 instances mitigates this, but can still cause downtime if one instance is not available due to a platform upgrade and one instance is not available due to a new deployment. 3 is best since in that regards.
Do note that if high availability is a must, then you need to also think multi-region. Think worst case scenario such as a natural disaster taking the one data center offline that you are using. It would be best to have something like Azure Traffic Manager (ATM) in front of two web apps (each with 3+ instances) and if ATM detects one of your web app regions is offline, it can reroute traffic to minimize downtime. Customer's who are running storefronts and the availability of their app is tied directly to their app being online will often take this approach.

Turning off ServiceFabric clusters overnight

We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets:
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.

Azure Availability Set vs Affinity Group

I'm a little bit confused about when to use Azure Availability Set and when to use Azure Affinity Group.
Lets look at the key purpose of Availability set and Affinity Group briefly to begin with.
Availability Set: is predominately to provide High Availability for your deployment. Azure does this via Fault domains and Upgrade domains.
A fault domain: is basically a different hardware rack in the same datacenter. The solution will be deployed in two different hardware racks.
Upgrade domains: is exactly same like fault domains in function, but they support upgrades rather than failures. The Upgrade domain is a logical unit of instance separation that determines which instances in a particular service will be upgraded at a point in time.
Affinity Group: In order to explain it, we need to take peek into Azure DC . Windows Azure Data Centers are purpose build , you might see rows and rows of containers (something like shipping containers) that contain clusters and racks. Each of those Containers have specific services, for example, Compute and Storage, SQL Azure, Service Bus, Access Control Service, and so on. Those containers are spread across the data center.
When you deploy a service using Portal or PowerShell , the service will talk directly to RDFE (Red Dog Front End). The RDFE controls the DC and nodes. The Cluster of nodes is controlled by Fabric Controller.. When you specify Affinity Group , the Fabric controller will place all the required elements of a deployment together. This has number of advantages like reducing latency (since required elements are close together) , Networking.
There are new changes brought in related to Network Affinity group , you can refer them (
To address you question
You would use Availability set when you want to have Highly Available system and also want to have SLA for Compute. Without Availability set there wont be SLA for your VM or PaaS Instances in other words will single instances of VM (IaaS) and PaaS wont have SLA and prone to downtime during HW failure and Upgrades of OS.
Availability set can be implemented after the deployment as well. Do note there is cost associated with the Availability set , since you are running additional instances , so they will be charged.
Affinity group you need to include them at the time of Creation of the services . It cannot be updated after the creation. So it very important to include Affinity group at the time of creation. There is no additional charges for including Affinity group.
Do share your feedback if the response addresses your question.

How to autoscale virtual machines(IaaS approach) in azure

How to autoscale virtual machines(IaaS approach) in azure instead of web/worker role autoscaling in azure?
You can now Autoscale Virtual machines in Azure directly in the Azure Management Portal. ScottGu has a post about it on his blog.
The important thing to autoscale VM's is you must proactively provision the Max # of VM's you think you'll need to handle your peak capacity, and add them to the same availability set.
For example, if on the busiest day of the week it takes 6 machines to handle all of your traffic, then you need to create 6 instances and install your application on it, configure it to handle traffic etc.... and then add it to an availability set with the other 5 machines.
Once you've done this, you can navigate to the Cloud Service that contains all of your virtual machines and click on the Scale tab. You should see a list of your availability sets, and it should tell you the # of machines you can scale over. Choose a metric (either CPU or Queue today), and then range of machines you want to scale between. You can scale between 1 and the total # of machines.
When load is low -- Azure will turn off machines (so you don't have to pay for them), and when load is high, Azure will turn those machines back on.
Auto-scaling on the IaaS level doesn't really make sense. Even if azure could detect high CPU usage and start a new VM based on it, what then? you still need to install your application on that VM automatically somehow.
What you are looking for is something that runs your app on azure, and installs new instances on new VM's if necessary. That "something" is called PaaS enabler. Basically it is another abstraction level between your app and the azure IaaS.
there are a couple of them out there :
Cloudify, CloudFoundary, Juju
as far as i know, only one that supports Azure is Cloudify. you can check out how to configure azure using Cloudify here : Configuring Azure
you can also check out the community - Cloudify Forum, or post questions here for assistance.
Disclaimer: I work for Gigaspaces, developing the Cloudify product line.
According to this it's possible to scale out IaaS with Availability sets by pre-provisioning the number of boxes:

How Windows Azure Platform scales instances and balances workload?

The Windows Azure Platform allows an application to be deployed to one or more instances. The fabric controller then balances your application's workload across those instances.
Can the number of instances be scaled up/down based on demand or are the number of instances static? If instances can be dynamically started how much control do I have over how this happens?
How does Azure balance workload amongst my application instances and do I have any control over how this happens?
I just want to add that by commercial launch (November), we'll have an API that lets you programmatically modify the number of instances. (So you can scale based on whatever logic you want.)
This question has lots of good information, including a 3rd party tool (AzureWatch) that I use that can scale up/down based on load.
Azure platform: scalling instances up and down
The number of instances for Azure roles is specified in an xml configuration file. Currently, you must manually change the instance count in this config file. When you do so, the fabric controller will automatically adjust the number of running instances for you.
For web roles, incoming TCP connections are balanced across your instances. For worker roles, the load is generally distributed across all instances picking up work assignments from a message queue. The fabric doesn't really get involved for worker roles.
I know this is an old question, but I just thought that I'd highlight the free Windows Autoscaling Application Block, which was released since the question was first asked.
