Long Running Tasks in Service Fabric and Scaling Cluster In - azure

We are using Azure Service Fabric (Stateless Service) which gets messages from the Azure Service Bus Message Queue and processes them. The tasks generally take between 5 mins and 5 hours.
When its busy we want to scale out servers, and when it gets quiet we want to scale back in again.
How do we scale in without interrupting long running tasks? Is there a way we can tell Service Fabric which server is free to scale in?

Azure Monitor Custom Metric
Integrate your SF service with
EventFlow. For instance, make it sending logs into Application Insights
While your task is being processed, send some logs in that will indicate that
it's in progress
Configure custom metric in Azure Monitor to scale in only in case on absence of the logs indicating that machine
has in-progress tasks
The trade-off here is to wait for all the events finished until the scale-in could happen.
There is a good article that explains how to Scale a Service Fabric cluster programmatically
Here is another approach which requires a bit of coding - Automate manual scaling
Develop another service either as part of SF application or as VM extension. The point here is to make the service running on all the nodes in a cluster and track the status of tasks execution.
There are well-defined steps how one could manually exclude SF node from the cluster -
Run Disable-ServiceFabricNode with intent ‘RemoveNode’ to disable the node you’re going to remove (the highest instance in that node type).
Run Get-ServiceFabricNode to make sure that the node has indeed transitioned to disabled. If not, wait until the node is disabled. You cannot hurry this step.
Follow the sample/instructions in the quick start template gallery to change the number of VMs by one in that Nodetype. The instance removed is the highest VM instance.
And so forth... Find more info here Scale a Service Fabric cluster in or out using auto-scale rules. The takeaway here is that these steps could be automated.
Implement scaling logic in a new service to monitor which nodes are finished with their tasks and stay idle to scale them in using instructions described in previous steps.
Hopefully it makes sense.
Thanks a lot to #tank104 for the help on elaborating my answer!

Related

Architecture recommendation - Azure Webjob

I have a webjob that subscribes to an Azure service Bus topic. The webjob automates a very important business process. For the Service bus, it is Premium SKU and have Geo-Recovery configured. My question is about the best practice to setup High Availability for my webjob (to ensure that the process runs always). I already have the App Service Plan deployed in two regions, and the webjob is installed in both the regions. However, I would like my webjob in the secondary region to run only if the primary region is down - maybe temporarily due to an outage. How can this be implemented? If I run both the webjob in parallel, that will create some serious duplication issues. Is there any architectural pattern I can refer to, or use any features within App Service or Azure to implement this?
With ServiceBus, when you can pick up a message, it is locked so shouldn't be picked up by another process unless the lock time expires or you issue a compled message back to service bus. In your case, if you are using Peek Lock, you can use it to prevent the same message being picked up by different instances. See docs
You can also make use of sessions which is available in the premium instance of ServiceBus. In this way, you can group messages to a session and each service instance handles their own session unless the other instance is not available.
Since WebJob is associated with App service , so really depends how you have configured this. You already mentioned that WebJobs are in 2 regions which mean you have app services running in 2 regions. (make sure you have multiple instance running in each region and different Availability zones).
Now it comes down what configuration you have regarding standby region. Is it Active/passive with hot Standby, Active/passive with cold Standby or is it active/Active. If your secondary region is Active where you have atleast one instance running then your webjob is actually processing the message.
I would recommend read through these patterns and understand.
Standby Regions Configuration , Multi Region Config
Regarding Service bus, When you are processing the message with Peek-Lock it means the message is not visible in the queue so no other instance would pick up. If your webjob is not able to process in time or failed to do or crash , the message become visible in the queue again and any other instance can pick it up so no two instances can pick same message.
Better Approach
I would recommend using Azure functions to process queue message .They are serverless offering with free invocations credit a month and are naturally highly available.
You can find more about here
Azure Function Svc Bus Trigger

Is a HostedService automatically scaled-out when AppService plan is autoscaling?

I have a .NET Core application running on Azure AppService.
I have also a hosted service interacting with a table and updating it according to a given condition.
I am experiencing an increase of CPU usage of just one of the instances when my autoscaling rules kick in and start scaling out the AppService.
My question is if HostedService is automatically scaled-out too when new instances are warmed up by autoscaling.
Yes they are. When you scale out, you have multiple instances of the same application running.
I figured this out the hard way. I was running a hosted service that sent out reports through emails once a day. After I scaled out to 2 instances, clients started receiving the same email twice a day.

How to delay Service Fabric runtime automatic upgrades

Our team recently had an incident due to our stateless services being restarted for azure runtime automatic updates. One of the services was in the middle of processing a task when it was forcefully shutdown. These tasks can take as long as 4 hours.
Either through code or configuration, is there a method for letting Azure know that our services are busy and can't be shutdown as this time?
In other words, how can we let Azure know when our services are ready for the service fabric runtime upgrade?
Well first of all, why don't you switch to manual upgrade mode?
Second, in the case of long running jobs you still have to take in account that nodes can fail, service instances can be moved or change role. All these kind of events will terminate your long running job if you don't handle shutdown notifications well.
The service is signaled that it will be shutdown etc. by Service Fabric by using the CancellationToken that is passed to RunAsync. The following is taken from the docs:
Service Fabric changes the Primary of a stateful service for a variety of reasons. The most common are cluster rebalancing and application upgrade. During these operations (as well as during normal service shutdown, like you'd see if the service was deleted), it is important that the service respect the CancellationToken.
Services that do not handle cancellation cleanly can experience several issues. These operations are slow because Service Fabric waits for the services to stop gracefully.
And this says the same but a bit shorter about the RunAsync method:
Make sure cancellationToken passed to RunAsync(CancellationToken) is honored and once it has been signaled, RunAsync(CancellationToken) exits gracefully as soon as possible.
In your case you should act on the CancellationToken being canceled. You should store the state of your current job somehow so you can resume it the next time RunAsync is called.
If it is really a long running job that cannot be interrupted and resumed by any means you should consider having this work done outside a Reliable Service, like a Web Job or something else. Or accept that some work might be lost.
In other words, you cannot tell Service Fabric to wait shutting down your service. It would mess up balancing and reliability of the cluster as well.
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-capacity#the-durability-characteristics-of-the-cluster
Durability tier privilege allows Service Fabric to pause any VM level infrastructure request (such as a VM reboot, VM reimage, or VM migration)
Bronze - No privileges. This is the default.
Silver - The infrastructure jobs can be paused for a duration of 10 minutes per UD.
Gold - The infrastructure jobs can be paused for a duration of 2 hours per UD. Gold durability can be enabled only on full node VM skus like D15_V2, G5 etc.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.management.servicefabric.models.nodetypedescription.durabilitylevel?view=azure-dotnet

Migration to Azure Service Fabric - Architectural considerations

We are on Azure since 2010 and had a great benefit from a performance and reliability in our application. Azure offers a lot of enterprise-level services and I think that the new "Azure Service Fabric" is great.
What I cannot understand by reading the documentation is the approach on migrating an "old" Cloud Service to the new Service Fabric. Why do we want to migrate? For horizontal scaling and more reliability.
Currently we have a single-instance cloud service, that spins up a lot of subservices. Those subservices are great candidates for microservices. The only problem is that some of these subservices are "runners", i.e. they just cycle on our users database and decide whether an operation (service) has to be run for a particular user or not.
How would you migrate a service like this considering that more than one instance may run this service?
Thanks
First thing to keep in mind is that once a service is started it keeps running, and his lifecycle and uptime is controlled by Service Fabric (ex: it will restart it automatically if it crashes). Second thing to keep in mind is that you will end-up with multiple instances of the service running at the same time (on different nodes), so they will end-up doing the exact same thing on different nodes of your cluster.
Your first reflex could be to have one stateless service kind/instance per runner "subservice" that keeps running and leverage the RunAsync (https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-advanced-usage). Personally, I wouldn't take that approach, since this could then require some kind of synchronization between services to prevent useless concurrency, since they do the exact same thing independently.
A better approach would be to have your runner services need to run only once in a while when requested by the "main" service acting as an orchestrator, you could have a Queue based approach where the "main" service submit tasks (messages) to be processed by the runners, who are listening concurrently on the same Queue, making sure that maximum one service instance would complete the task.
For the Queue, think Service Bus or Reliable Concurrent Queue (https://learn.microsoft.com/enus/dotnet/api/microsoft.servicefabric.data.collections.preview.ireliableconcurrentqueue-1).

Turning off ServiceFabric clusters overnight

We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.

Resources