service fabric unhealthy service affect other services - azure

I'm trying to understand service fabric logic to consider a node in a cluster as unhealthy.
I recently deployed a new version of our application that had 3 unhealthy worker services running on all nodes, they are very light services loading messages from a queue, but because their frequent failures, all other services running on same node were affected by some reason, so all services are reported as unhealthy.
I assume this behavior is a service fabric health monitoring thinking the node is not healthy because multiple services are failing on same node. Is this right?
What is the measures that SF uses to consider a node as unhealthy.

Service Fabric's health model is described in detail here. The measures are always "health reports". Service Fabric emits some health reports on its own, but the model is also extensible and you can add your own.
Regardless of whether you've added any new health reports or are relying only on what is present in the system by default, then you can see what health reports are being emitted for a given node by either selecting the node specifically within SFX or by running a command like the following:
Get-ServiceFabricNodeHealth -NodeName Node1
As we saw in the doc, Node health is mainly determined by
Health Reports against that particular node (ex: Node went down)
Failures of a Deployed Application
Failures of a particular Deployed Service Package (usually the code packages within in)
In these cases SF tries to grab as much information about what failed (exit codes, exceptions and their stack traces, etc) and reports a health warning or error for that node.

Related

Azure App Service and infrastructure maintenance

As I understand there is no concept of update domain in App Services (and in other PaaS offerings). I am wondering how Azure is handling OS updates if I have only a single instance of an App Service app. Do I need to plan for two and more instances if I want to avoid such cases when an app goes down during the OS/other updates or this is handled without downtime? According to docs App Service has 99.95% SLA - is this time reserved here?
First of all, welcome to the community.
Your application will not become unavailable when App Services is patching the OS, you don't have to worry about that. Imagine if that would be the case, it would be a huge problem. Instead, the PaaS service will make sure your application is replicated to an updated worker node before that happens.
But you should have multiple instances, as a best practice listed in this article:
To avoid a single point-of-failure, run your app with at least 2-3 instances.
Running more than one instance ensures that your application is available when App Service moves or upgrades the underlying VM instances
Have a look at this detailed blog post:
https://azure.github.io/AppService/2018/01/18/Demystifying-the-magic-behind-App-Service-OS-updates.html
When the update reaches a specific region, we update available instances without apps on them, then move the apps to the updated instances, then update the offloaded instances.
The SLA is the same regardless the number of instances, even if you select "1 instance":
We guarantee that Apps running in a customer subscription will be available 99.95% of the time
Have a look at Hyper-V and VMWare, it will give you a rough idea on how App Services handle that.
If you're looking for zero-downtime deployments with App Services, what you are looking for are deployment slots.
Managing versions can be confusing, take a look at this issue I opened, it gives you a detailed how-to approach about managing different slot versions, which is not clearly described by Microsoft docs.

Azure service fabric - Performance telemetry of individual service

Is there a way to get the performance telemetry of an individual service running in a node which has other services running in the same node within a servive fabric cluster?
We are using .net core where there are not performance counters either and we arent using containers at the moment. We want to make sure one microservice doesnt hog all the system resources and choke the other microservices running in the same node. We are using guest executables.
We use application insight. It has support for microservices for service fabric ie correlation id where you can trace a request through multiple services within service fabric
Here is set up instruction and example
https://github.com/Microsoft/ApplicationInsights-ServiceFabric

Long Running Tasks in Service Fabric and Scaling Cluster In

We are using Azure Service Fabric (Stateless Service) which gets messages from the Azure Service Bus Message Queue and processes them. The tasks generally take between 5 mins and 5 hours.
When its busy we want to scale out servers, and when it gets quiet we want to scale back in again.
How do we scale in without interrupting long running tasks? Is there a way we can tell Service Fabric which server is free to scale in?
Azure Monitor Custom Metric
Integrate your SF service with
EventFlow. For instance, make it sending logs into Application Insights
While your task is being processed, send some logs in that will indicate that
it's in progress
Configure custom metric in Azure Monitor to scale in only in case on absence of the logs indicating that machine
has in-progress tasks
The trade-off here is to wait for all the events finished until the scale-in could happen.
There is a good article that explains how to Scale a Service Fabric cluster programmatically
Here is another approach which requires a bit of coding - Automate manual scaling
Develop another service either as part of SF application or as VM extension. The point here is to make the service running on all the nodes in a cluster and track the status of tasks execution.
There are well-defined steps how one could manually exclude SF node from the cluster -
Run Disable-ServiceFabricNode with intent ‘RemoveNode’ to disable the node you’re going to remove (the highest instance in that node type).
Run Get-ServiceFabricNode to make sure that the node has indeed transitioned to disabled. If not, wait until the node is disabled. You cannot hurry this step.
Follow the sample/instructions in the quick start template gallery to change the number of VMs by one in that Nodetype. The instance removed is the highest VM instance.
And so forth... Find more info here Scale a Service Fabric cluster in or out using auto-scale rules. The takeaway here is that these steps could be automated.
Implement scaling logic in a new service to monitor which nodes are finished with their tasks and stay idle to scale them in using instructions described in previous steps.
Hopefully it makes sense.
Thanks a lot to #tank104 for the help on elaborating my answer!

Migration to Azure Service Fabric - Architectural considerations

We are on Azure since 2010 and had a great benefit from a performance and reliability in our application. Azure offers a lot of enterprise-level services and I think that the new "Azure Service Fabric" is great.
What I cannot understand by reading the documentation is the approach on migrating an "old" Cloud Service to the new Service Fabric. Why do we want to migrate? For horizontal scaling and more reliability.
Currently we have a single-instance cloud service, that spins up a lot of subservices. Those subservices are great candidates for microservices. The only problem is that some of these subservices are "runners", i.e. they just cycle on our users database and decide whether an operation (service) has to be run for a particular user or not.
How would you migrate a service like this considering that more than one instance may run this service?
Thanks
First thing to keep in mind is that once a service is started it keeps running, and his lifecycle and uptime is controlled by Service Fabric (ex: it will restart it automatically if it crashes). Second thing to keep in mind is that you will end-up with multiple instances of the service running at the same time (on different nodes), so they will end-up doing the exact same thing on different nodes of your cluster.
Your first reflex could be to have one stateless service kind/instance per runner "subservice" that keeps running and leverage the RunAsync (https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-advanced-usage). Personally, I wouldn't take that approach, since this could then require some kind of synchronization between services to prevent useless concurrency, since they do the exact same thing independently.
A better approach would be to have your runner services need to run only once in a while when requested by the "main" service acting as an orchestrator, you could have a Queue based approach where the "main" service submit tasks (messages) to be processed by the runners, who are listening concurrently on the same Queue, making sure that maximum one service instance would complete the task.
For the Queue, think Service Bus or Reliable Concurrent Queue (https://learn.microsoft.com/enus/dotnet/api/microsoft.servicefabric.data.collections.preview.ireliableconcurrentqueue-1).

Waiting for a service to be ready (Service Fabric)

I have four services running on Azure Service Fabric, but two of those 4 services depend on another one, is there a way to make a service initialization wait until another service announces it is ready?
No. There's no ordering to service creation (services can be created at any time, not just during a deployment from your build machine), and what does it even mean for your service to be ready? From our perspective it means the Failover Manager found nodes that the service is able to run on and the code packages have been activated on those nodes. The platform doesn't know what your service code does though. From your perspective it probably means "when it's responding to my requests" otherwise it's not "ready," which can happen at any time during the service's lifetime for any number of reasons:
Service was just deployed and its communication stack hasn't opened an endpoint yet
Service instance/replica moved and its communication stack is spinning back up on a new node
Service partition is in quorum loss and not accepting write operations
etc.
This is an ongoing thing that your services need to be prepared to handle. If two of services can't do any work until they are able to talk to another service, then they need to poll for that service they depend on until it's available through an endpoint on that service that you define.

Resources