We have a problem statement i.e. We are using Azure Service Fabric for our production. We have service fabric with Silver Tier. Our issue is when a single instance gets Spike i.e. due to High CPU utilization and Memory. Load balancer is unable to transfer request to other nodes. Single node get 90 percent utilization and we are even unable to RDP that node during that time. I have seen articles from Microsoft about adding placement constraints. Still that didn't work either. We are unable to apply rules to loadbalancer as we have integrated APIM with Service Fabric. I had multiple calls with Microsoft Still didn't get appropriate solution which could work. I need a solution to my problem.
I know we have issue in one of our services we are already working on it but we need SF to handle this scenario as well.
If one or more of your services generates CPU / memory spikes (and not a consistent high utilization) then it will be very hard to balance such behavior.
Anyway, you can do two things to mitigate it:
Use resource governance to restrict the amount of CPU and memory that this problematic service can consume
Microsoft released FabricObserver which can be used to extend the monitoring of our SF cluster. You can have a look and see how you can leverage AppObserver to report CPU and memory usages of a single service (process) as LoadMetrics and use it to balance the cluster
Related
I have some workload which needs to be run a few times per week. It requires some heavy computational work and runs about one hour (with 16 cores and 32gb memory). It is possible to run it in a container.
Azure offers many different possibilities to run containers. (I have no knowledge of most of the Azure services, so my conclusions might be wrong.) Firstly, I thought Azure Container Instances is perfect for this scenario, but it only offers containers with up to 4 vCPU and 16gb memory. There is no need for orchestration with a single container, so Azure Kubernetes Service and Azure Service Fabric come with too much overhead. Similarly, Azure Batch also offers computational clusters which are not needed for a single workload.
Which Azure service is the best fit for this use case?
While a "best fit" question is likely to be closed. Anyways, here's a suggestion.
Don't dismiss AKS. You can easily create a 1 node cluster using a VM that fits your required configuration. Using the standard SLA, you don't pay for the master node and you can stop your cluster after each run and stop being charged. No need to bother about orchestration, see this as a VM that has everything to run your container that you'll use like an ACI.
I am running an optimization model (using Google.OrTools) that I build in .Net framework. When I run in my local, the application was running with a CPU of more than 99%, so my team has decided to move this application to Azure ScaleSet where I have one VM and I configured to Scale up to 10 VMs. The problem I face is the same >99% CPU only in my main VM even though new VMs have been added (scaled-up), the CPU on that VMs are <1%. I am now confused about working with ScaleSets in Azure.
In my above case, I am thinking that the job has not been shared with other VMs. How can I resolve this?
Please note that I am running my application using a Console Application and this job does not have frequent connections with database and also Drive, this job is a purely mathematical problem.
Customer will use Azure VMSS as the front endpoint(Or backend pool).
Azure VMSS autoscale ability reduces the management overhead to monitor and tune your scale set as customer demand changes over time.
Azure VMSS will use Azure load balancer to route traffic to all VMSS instances, in this way, all instances CPU usage are consistent.
If your service running without other requests, or other connections, the CPU usage is 99%, it means you should resize that VM to a high size.
First, your preferences and your budget don't determine whether your workload can scale out rather than scale up.
An Azure scale set includes some backend VMs and a load balancer. The load balancer distributes requests to the backend servers.
Your workload can take advantage of an Azure scale set if it consists of multiple, independent requests. The canonical example of this kind of workload is a web server. Running this kind of workload on an Azure scale set doesn't usually require any changes to code.
You might be able to run your workload on a scale set if you have a single request that can be broken down into smaller pieces that can be processed independently. For this kind of parallel processing to work, you'd probably have to rewrite some of your code. The load balancer would see these smaller pieces as multiple requests.
Other ways to improve mathematical performance include
using a different, more appropriate language,
running your code on a GPU rather than a CPU, or
leveraging a third-party system, like Wolfram Mathematica.
I'm sure there are other ways.
Imagine you have 10 physical machines in the lab. How would you split up this task to run faster, on all the machines?
A scale set is a collection of VMs. To make use of scale sets, and autoscale, your compute intensive job needs to be parallelizable. For example, if you can split it into many sub-tasks, then each VM in the scale set can request a sub-task, compute it, send the result somewhere for aggregation, and request another task.
Here is an example of a compute intensive task running on 1000 VMs in a scale set: https://techcommunity.microsoft.com/t5/Microsoft-Ignite-Content-2017/The-journey-to-provision-and-manage-a-thousand-VM-application/td-p/99113
I have a Linux Standard B2ms azure virtual machine. I have disabled the autoshutdown feature you see in your dashboard under operations. For some reason this server was still shutdown after running about 8 days.
What reasons are there which could shutdown this server if I haven't changed anything on it the last three days?
What reasons are there which could shutdown this server if I haven't
changed anything on it the last three days?
There are many reasons will shutdown this VM, maybe we should try to find some logs about this.
First, we should check Azure Alerts via Azure portal, try to find some logs about you VM.
Second, we should check this VM's performance, maybe high CPU usage or high memory usage, we can find logs in /var/log/*.
Also we can try to find are there some issue about Azure service, we can check service Health -> Health history to find are there some issues in your region.
By the way, if we just create one VM in Azure, we can't avoid a single point of failure. In Azure, Microsoft recommended that two or more VMs are created within an availability set to provide for a highly available application and to meet the 99.95% Azure SLA.
An availability set is composed of two additional groupings that protect against hardware failures and allow updates to safely be applied - fault domains (FDs) and update domains (UDs).
Fault domains:
A fault domain is a logical group of underlying hardware that share a common power source and network switch, similar to a rack within an on-premises datacenter. As you create VMs within an availability set, the Azure platform automatically distributes your VMs across these fault domains. This approach limits the impact of potential physical hardware failures, network outages, or power interruptions.
Update domains:
An update domain is a logical group of underlying hardware that can undergo maintenance or be rebooted at the same time. As you create VMs within an availability set, the Azure platform automatically distributes your VMs across these update domains. This approach ensures that at least one instance of your application always remains running as the Azure platform undergoes periodic maintenance. The order of update domains being rebooted may not proceed sequentially during planned maintenance, but only one update domain is rebooted at a time.
In your scenario, maybe there are some unplanned maintenance events,when Microsoft update the VM host, they will migrate your VM to another host, they will shutdown your VM then migrate it.
To achieve a highly available, maybe we should create at least two VMs in one availability set.
We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.
I'm seeing a definite non-round-robin load-balancing pattern in Azure's load balancer for my cloud role. Most of the requests are going to the 1st instance of the two-instance of my Web-Api worker role setup.
How can I ensure that Azure's LB distributes requests equally?
Note the first screenshot from CloudMonix's dashboard contains CPU Utilization for 1st instance (60-65% sustained average) and 2nd screenshot contains CPU utilization for 2nd instance (2-5% sustained average)
This is consistent across many different times I've looked into this.
Both of the instances are the same, only listen to many http requests and process them.
There actually is a way of configuring the loadBalancerDistribution for a Cloud Service in the .csdef file. The flaw is documentation updates :-(
Please look at this article: https://azure.microsoft.com/en-us/blog/azure-load-balancer-new-distribution-mode/
The value of LoadBalancerDistribution can be sourceIP for 2-tuple affinity, sourceIPProtocol for 3-tuple affinity or none (for no affinity. i.e. 5-tuple)
I'll look in to getting the schema article updated to reflect this.
As for the load distribution - if you have not specifically chosen the 2- or 3-tuple algorithm, you should be running with the 5-tuple.
You can use https://resources.azure.com to look at the current configuration.
I know that CPU is a reflection of load, but the load balancer balances based on network sessions, so please ensure that the CPU load and distribution of network sessions correlate. In your situation I would be surprised if they do not - just a reminder.
Please look at this article to ensure you are not running with keep-alives: Extremely uneven cloud service load-balancing with Azure
I've definitely had the same question in the past, but have noticed that over a sustained period (a few days or more) that the requests are balanced between the instances. From my personal research you cannot configure the load balancing on azure cloud services. Here is a document describing the service definition file, and I would imagine that if it was configurable, it would be in there.
However, you can configure the load balancer more explicitly using Azure Resource Manager.