Migration to Azure Service Fabric - Architectural considerations - azure

We are on Azure since 2010 and had a great benefit from a performance and reliability in our application. Azure offers a lot of enterprise-level services and I think that the new "Azure Service Fabric" is great.
What I cannot understand by reading the documentation is the approach on migrating an "old" Cloud Service to the new Service Fabric. Why do we want to migrate? For horizontal scaling and more reliability.
Currently we have a single-instance cloud service, that spins up a lot of subservices. Those subservices are great candidates for microservices. The only problem is that some of these subservices are "runners", i.e. they just cycle on our users database and decide whether an operation (service) has to be run for a particular user or not.
How would you migrate a service like this considering that more than one instance may run this service?
Thanks

First thing to keep in mind is that once a service is started it keeps running, and his lifecycle and uptime is controlled by Service Fabric (ex: it will restart it automatically if it crashes). Second thing to keep in mind is that you will end-up with multiple instances of the service running at the same time (on different nodes), so they will end-up doing the exact same thing on different nodes of your cluster.
Your first reflex could be to have one stateless service kind/instance per runner "subservice" that keeps running and leverage the RunAsync (https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-advanced-usage). Personally, I wouldn't take that approach, since this could then require some kind of synchronization between services to prevent useless concurrency, since they do the exact same thing independently.
A better approach would be to have your runner services need to run only once in a while when requested by the "main" service acting as an orchestrator, you could have a Queue based approach where the "main" service submit tasks (messages) to be processed by the runners, who are listening concurrently on the same Queue, making sure that maximum one service instance would complete the task.
For the Queue, think Service Bus or Reliable Concurrent Queue (https://learn.microsoft.com/enus/dotnet/api/microsoft.servicefabric.data.collections.preview.ireliableconcurrentqueue-1).

Related

Azure App Service and infrastructure maintenance

As I understand there is no concept of update domain in App Services (and in other PaaS offerings). I am wondering how Azure is handling OS updates if I have only a single instance of an App Service app. Do I need to plan for two and more instances if I want to avoid such cases when an app goes down during the OS/other updates or this is handled without downtime? According to docs App Service has 99.95% SLA - is this time reserved here?
First of all, welcome to the community.
Your application will not become unavailable when App Services is patching the OS, you don't have to worry about that. Imagine if that would be the case, it would be a huge problem. Instead, the PaaS service will make sure your application is replicated to an updated worker node before that happens.
But you should have multiple instances, as a best practice listed in this article:
To avoid a single point-of-failure, run your app with at least 2-3 instances.
Running more than one instance ensures that your application is available when App Service moves or upgrades the underlying VM instances
Have a look at this detailed blog post:
https://azure.github.io/AppService/2018/01/18/Demystifying-the-magic-behind-App-Service-OS-updates.html
When the update reaches a specific region, we update available instances without apps on them, then move the apps to the updated instances, then update the offloaded instances.
The SLA is the same regardless the number of instances, even if you select "1 instance":
We guarantee that Apps running in a customer subscription will be available 99.95% of the time
Have a look at Hyper-V and VMWare, it will give you a rough idea on how App Services handle that.
If you're looking for zero-downtime deployments with App Services, what you are looking for are deployment slots.
Managing versions can be confusing, take a look at this issue I opened, it gives you a detailed how-to approach about managing different slot versions, which is not clearly described by Microsoft docs.

Application per service in Service Fabric

I’m designing my service fabric cluster. I’m between creating one app and hosting all the services inside vs creating 1 app per service.
I didnt find clear guidelines on this. The main advantage I see for 1 app per service is that we can deploy each service independently since it has its own app. We can also host the code in different repos. Are there downsides for this?
A better approach is to have one Application per set of services where the services provide a cohesive function. An Application should be an umbrella for n number of services which are related in their function, for instance they may be within the same bounded context or be related to a common operational unit. However, this doesn't mean they have to be deployed / updated in unison.
Services can be deployed independently within an ApplicationType if you move away from using the DefaultServices construct. You can read about why Default Services should be avoided in Production here - essentially they create a rigid deployment strategy and you lose some of the power of Service Fabric parameterization available via PowerShell.
The concept of an Application may seem at odds with a Microservice architecture, but remember its just a logical grouping, single services within an Application are still independently deployable.
Lots of useful info in the Application Model docs.
The main advantage I see for 1 app per service is that we can deploy each service independently since it has its own app.
You can also deploy\upgrade individual services in same application without affecting the other deployed services. Please check about differential packaging here and here
We can also host the code in different repos
Generally when we split our code into separate repositories is because we have a domain boundary that we don't want to track with other services, for example, services owned by different teams or deployed on different schedule, in this case would make sense to have them as separate applications.
Are there downsides for this?
Technically, no. But there are some possible points you have to keep in mind.
When we talk about Microservices we see them as independent services running on their own with as few dependency as possible on other services, when we talk about applications we kinda go against this 'law', because we have to deploy them together, we shouldn't see applications that way, because the applications is just a logical isolation for these services, so where is the benefit on SF applications?
When you have multiple services deployed (dependent or not on each other), you need a way to keep track of them as a bigger unit, otherwise you might end with:
a cluster full of services that sometimes are not required anymore, and is just there because we 'might be using them' or someone forgot to remove when their peers got obsolete.
Dependent services missing on new deployments
Version of services not compatible with each other (contracts, APIs, and so on)
SF Applications works like a snapshot of these services, so for example, whenever a new service get updated, you also upgrade the application to reference the new definition of your services and their dependencies, this will tell SF "this is how I want my services running" and SF will manage to get them exactly as you described. Does not mean you have to update all of them when a upgrade is required, SF will do if you have to, but you can update just the ones that changed, and them deploy a version of your application that SF will manage the version of each service for you. An analogy, it is like a docker compose file where you specify the containers you have to deploy as a single deployment.
Given that, when you opt out of application concept, you loose these benefits, because now you have to manage every single service on their own and keep track of the versions they depend on, and in cases where two services on different applications need to be deployed together (because of breaking changes for example) you would not be able to easily rollback if one of them fail, because they are not dependent on each other anymore, so you would have to write your own logic to handle this.
A typical scenario you might find yourself in is where a new version of a service get updated and others not updated on same release might stop working, but for your deployment, the new service looks OK, without any error.
So, at the end, is just a trade off, you opt for more flexibility deploying your service, but end up with more maintenance.

Long Running Tasks in Service Fabric and Scaling Cluster In

We are using Azure Service Fabric (Stateless Service) which gets messages from the Azure Service Bus Message Queue and processes them. The tasks generally take between 5 mins and 5 hours.
When its busy we want to scale out servers, and when it gets quiet we want to scale back in again.
How do we scale in without interrupting long running tasks? Is there a way we can tell Service Fabric which server is free to scale in?
Azure Monitor Custom Metric
Integrate your SF service with
EventFlow. For instance, make it sending logs into Application Insights
While your task is being processed, send some logs in that will indicate that
it's in progress
Configure custom metric in Azure Monitor to scale in only in case on absence of the logs indicating that machine
has in-progress tasks
The trade-off here is to wait for all the events finished until the scale-in could happen.
There is a good article that explains how to Scale a Service Fabric cluster programmatically
Here is another approach which requires a bit of coding - Automate manual scaling
Develop another service either as part of SF application or as VM extension. The point here is to make the service running on all the nodes in a cluster and track the status of tasks execution.
There are well-defined steps how one could manually exclude SF node from the cluster -
Run Disable-ServiceFabricNode with intent ‘RemoveNode’ to disable the node you’re going to remove (the highest instance in that node type).
Run Get-ServiceFabricNode to make sure that the node has indeed transitioned to disabled. If not, wait until the node is disabled. You cannot hurry this step.
Follow the sample/instructions in the quick start template gallery to change the number of VMs by one in that Nodetype. The instance removed is the highest VM instance.
And so forth... Find more info here Scale a Service Fabric cluster in or out using auto-scale rules. The takeaway here is that these steps could be automated.
Implement scaling logic in a new service to monitor which nodes are finished with their tasks and stay idle to scale them in using instructions described in previous steps.
Hopefully it makes sense.
Thanks a lot to #tank104 for the help on elaborating my answer!

Waiting for a service to be ready (Service Fabric)

I have four services running on Azure Service Fabric, but two of those 4 services depend on another one, is there a way to make a service initialization wait until another service announces it is ready?
No. There's no ordering to service creation (services can be created at any time, not just during a deployment from your build machine), and what does it even mean for your service to be ready? From our perspective it means the Failover Manager found nodes that the service is able to run on and the code packages have been activated on those nodes. The platform doesn't know what your service code does though. From your perspective it probably means "when it's responding to my requests" otherwise it's not "ready," which can happen at any time during the service's lifetime for any number of reasons:
Service was just deployed and its communication stack hasn't opened an endpoint yet
Service instance/replica moved and its communication stack is spinning back up on a new node
Service partition is in quorum loss and not accepting write operations
etc.
This is an ongoing thing that your services need to be prepared to handle. If two of services can't do any work until they are able to talk to another service, then they need to poll for that service they depend on until it's available through an endpoint on that service that you define.

Windows Azure Inter-Role communication

I want to create an Azure application which does the following:
User is presented with a MVC 4 website (web role) which shows a list of commands.
When the user selects a command, it is broadcast to all worker roles.
Worker roles process the task, store the results and notify web role
Web role displays the combined results of the worker roles
From what I've been reading there seem to be two ways of doing this: the Windows Azure Service Bus or using Queues. Each worker role also stores the results in the database.
The Service Bus seems more appropriate with its publish/subscribe model, so all worker roles would get the same command and roughly the same time. Queues seem easier to use though.
Can the service bus be used locally with the emulator when developing? I am using a free trial and cannot keep the application constantly whilst still developing. Also, when using queues how can you notify the web role that processing is complete?
I agree. ServiceBus is a better choice for this messaging requirement. You could, with some effort, do the same with queues. But, you'll be writing a lot of code to implement things that the ServiceBus already gives you.
There is not a local emulator for ServiceBus like there is for the Azure Strorage service (queues/tables/blobs). However, you could still use the ServiceBus for messaging between roles while they are running locally in your development environment.
As for your last question about notifying the web role that processing is complete, there are a several ways to go here. Just a few thoughts (not exhaustive list)...
Table storage where the web role can periodically check the status of the unit of work.
Another ServiceBus Queue/topic for completed work.
Internal endpoints. You'll have to have logic to know if it's just an update from worker role N or if it is indicating a completed unit of work for all worker roles.
I agree with Rick's answer, but would also add the following things to think about:
If you choose the Service Bus Topic approach then as each worker role comes online it would need to generate a subscription to the topic. You'll need to think about subscription maintenance of when one of the workers has a failure and is recycled, or any number of reasons why a subscription may be out there.
Telling the web role that all the workers are complete is interesting. The options Rick provides are good ones, but you'll need to think about some things here. It means that the web role needs to know just how many workers are out there or some other mechanism to decide when all have reported done. You could have the situation of five worker roles receieving a message and start working, then one of them starts to repeatedly fail processing. The other four report their completion but now the web role is waiting on the fifth. How long do you wait for a reply? Can you continue? What if you just told the system to scale down and while the web role thinks there are 5 there is now only 4. These are things you'll need to to think about and they all depend on your requirements.
Based on your question, you could use either queue service and get good results. But each of them are going to have different challenges to overcome as well as advantages.
Some advantages of service bus queues is that it provides blocking receipt with a persistent connection (up to 100 connections), it can monitor messages for completion, and it can send larger messages (256KB).
Some advantages of storage queues over the service bus solution is that it's slightly faster (if 15 ms matters to you), you can use a single storage system (since you'll probably be using Storage for blob and table services anyways), and simple auto-scaling. If you need to auto-scale your worker roles based on the load, passing the the requests through a storage queue makes auto-scaling trivial -- you just setup auto-scaling in the Azure Cloud Service UI under the scale tab.
A more in-depth comparison of the two azure queue services can be found here: http://msdn.microsoft.com/en-us/library/hh767287.aspx
Also, when using queues how can you notify the web role that processing is complete?
For the Azure Storage Queues solution, I've written a library that can help: https://github.com/brentrossen/AzureDistributedService.
It provides a proxy layer that facilitates RPC style communication from web roles to worker roles and back through Storage Queues.

Resources