I have a Service Fabric cluster hosted in Microsoft Azure, and I have configured its scale set to register all nodes with Azure Automation DSC (following the example from https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/dsc-template#template-example-for-windows-virtual-machine-scale-sets).
I now need to update the DSC script to also ensure that TLS 1.0 is disabled. This registry change requires a reboot of the affected machines. How can I get DSC to apply this change one update domain at a time so that all the VMs in my cluster aren't rebooted at the same time?
This depends on the durability level that you have configured for your cluster:
Gold Restarts can be delayed until approved by the Service Fabric cluster. Updates can be paused for 2 hours per UD to allow additional time for replicas to recover from earlier failures
Silver Restarts can be delayed until approved by the Service Fabric cluster. Updates cannot be delayed for any significant period of time
Bronze Restarts will not be delayed by the Service Fabric cluster. Updates cannot be delayed for any significant period of time
So, you'll need your cluster to have either Silver or Gold level.
Related
I want to upgrade my AKS cluster using terraform without or with minimal downtime.
What happens to the workloads during the cluster upgrade.
Can i do the AKS cluster upgrade and node upgrade same time.
Azure provides the Scheduled AKS cluster maintenance (preview feature) , is it Azure does the cluster upgrade?
You have several questions listed here so I will try to answer them as best as I can. Your questions are generic and not related to Terraform, so I will address Terraform separately at the bottom.
What happens to the workloads during the cluster upgrade.
During an upgrade, it depends on whether Azure is doing the upgrade, or you are doing it manually. If Azure does the upgrade, it may be disruptive depending on the settings you choose when you create the cluster.
If you do the upgrade yourself, you can do it with no downtime, but it does require some azure cli usage due to how the AKS terraform code is designed.
Can i do the AKS cluster upgrade and node upgrade same time.
Yes. If your node is out of date and you schedule a cluster upgrade, the nodes will be brought up to date in the process of upgrading the cluster.
Azure provides the Scheduled AKS cluster maintenance (preview feature) , is it Azure does the cluster upgrade?
No. A different setting determines if Azure does the upgrade. This Scheduled Maintenance feature is designed to allow you to specify what times and days Microsoft is NOT allowed to do maintenance. The default when you don't specify a Scheduled Maintenance is that Microsoft may perform upgrades at any time:
https://learn.microsoft.com/en-us/azure/aks/planned-maintenance
Your AKS cluster has regular maintenance performed on it automatically. By default, this work can happen at any time. Planned Maintenance allows you to schedule weekly maintenance windows that will update your control plane as well as your kube-system Pods on a VMSS instance and minimize workload impact
The feature you are looking for regarding AKS performing cluster upgrades is called Cluster Autoupgrade, and you can read about that here: https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster#set-auto-upgrade-channel-preview
Now in regards to performing a cluster upgrade with Terraform. Currently, due to how azurerm_kubernetes_cluster is designed, it is not possible to perform an upgrade of a cluster using only Terraform. Some azure-cli usage is required. It is possible to perform a cluster upgrade without downtime, but not possible by exclusively using Terraform. The steps to perform such an upgrade are detailed pretty well in this blog post: https://blog.gft.com/pl/2020/08/26/zero-downtime-migration-of-azure-kubernetes-clusters-managed-by-terraform/
AKS Cluster uses concept of buffer node when upgrade is performed. It brings a buffer node, move the workload to buffer node and upgrades the actual node. Time taken to upgrade the cluster depends on number of nodes in the cluster.
https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster#upgrade-an-aks-cluster
You can upgrade Control Plane as well as Hosted Plane using Azure CLI.
#az aks upgrade --resource-group <ResourceGroup> --name <ClusterName> -k <KubernetesVersion>
We are using Azure Service Fabric (Stateless Service) which gets messages from the Azure Service Bus Message Queue and processes them. The tasks generally take between 5 mins and 5 hours.
When its busy we want to scale out servers, and when it gets quiet we want to scale back in again.
How do we scale in without interrupting long running tasks? Is there a way we can tell Service Fabric which server is free to scale in?
Azure Monitor Custom Metric
Integrate your SF service with
EventFlow. For instance, make it sending logs into Application Insights
While your task is being processed, send some logs in that will indicate that
it's in progress
Configure custom metric in Azure Monitor to scale in only in case on absence of the logs indicating that machine
has in-progress tasks
The trade-off here is to wait for all the events finished until the scale-in could happen.
There is a good article that explains how to Scale a Service Fabric cluster programmatically
Here is another approach which requires a bit of coding - Automate manual scaling
Develop another service either as part of SF application or as VM extension. The point here is to make the service running on all the nodes in a cluster and track the status of tasks execution.
There are well-defined steps how one could manually exclude SF node from the cluster -
Run Disable-ServiceFabricNode with intent ‘RemoveNode’ to disable the node you’re going to remove (the highest instance in that node type).
Run Get-ServiceFabricNode to make sure that the node has indeed transitioned to disabled. If not, wait until the node is disabled. You cannot hurry this step.
Follow the sample/instructions in the quick start template gallery to change the number of VMs by one in that Nodetype. The instance removed is the highest VM instance.
And so forth... Find more info here Scale a Service Fabric cluster in or out using auto-scale rules. The takeaway here is that these steps could be automated.
Implement scaling logic in a new service to monitor which nodes are finished with their tasks and stay idle to scale them in using instructions described in previous steps.
Hopefully it makes sense.
Thanks a lot to #tank104 for the help on elaborating my answer!
We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.
Once I create the azure service fabric cluster through azure portal, I am not sure how long I am supposed to wait for the cluster to be up and running. I am only using bare minimum configuration (with node type count 1 and bronze model with 3 VMs etc.) Will take an hour or 2 or more or less? Will there be some kind of indication that cluster deployment is done and is available for me to publish code from visual studio? Also I am not seeing any nodes in the provisioned cluster in the portal.
Thanks.
Raghu/..
Per mckjerral, I changed my VM Size type from A1 standard to DS1 standard and also to reliability tier from Bronze to Silver type, it deployed successfully and I was able to publish my service fabric app to it. Thank you for your help.
Raghu/..
Microsoft just sent out an email notifying our company that there will be scheduled maintenance for our Windows Azure environment.
We will be performing maintenance on our networking hardware. We are
scheduling the update to occur during nonbusiness hours as much as
possible, in each maintenance region. Single and multi-instance
Virtual Machines and Cloud Services deployments will reboot once
during this maintenance operation. Each instance reboot should last 30
to 45 minutes.
We suggest using availability sets in the architecture to protect
against downtime caused by planned maintenance. This maintenance will
proceed by updating instances in only one Fault Domain (FD) at a time
for the Cloud Services and Virtual Machines in an Availability Set.
Now our website consists of a Cloud Service with 8 (small) instances of a web role. With these 8 instances, is there still a possibilty of downtime for the website? Do we need to use 'Availability Sets' or are we safe? Thanks for any info..
Depends on which service you're referring to. From my understanding, because you mentioned "Web Role", you're talking about Cloud Services (PaaS).
In General:
If you have Cloud Services (PaaS), which is what you have based on my understanding, then you won't have any downtime, no.
If you have VMs (Virtual Machines) that don't belong to the same Availability Set, then there is a chance of downtime. To fix that, just make sure they are on the same Availability Set. If you don't have VMs, ignore this.
Hope it helps.