Do Service Fabric stateless singleton service restarts when fails? - azure

We use Service Fabric to deploy stateless microservices. One of the microservices is designed as a singleton. This means it is designed to be deployed on a single node only:
InstanceCount = 1
Normally, if there is more than 1 instance and one fails, the others keep working.
But how does the single instance behave? I cannot find this scenario in the documentation. I only found out that when the node is updated and the parameter IsSingletonReplicaMoveAllowedDuringUpgrade is set to true then it can be moved to other node, but no source explicitly says what happens when the singleton fails during execution.
Does it restart automatically? And if so, then how long is the downtime?

Service Fabric will restart the service for you automatically. The time it takes to restart can depend on how loaded the machine is, how large the service is, and they type of failure, but is typically within a couple of seconds.
The amount of time it takes to restart can also depend on how the service failed. Process crashes are quicker to recover from. Machine failures or networking cuts can take longer to detect, but even in these cases SF will usually restart things within 10-30 seconds.

Related

Service Fabric - how to handle exceptions in startup / WebHostBuilder.Build()

I have a Service Fabric app, if the configuration is incorrect, I get an exception building my web host. For this particularly scenario, the app is never going to start, and Service Fabric is going to continually try to start / provision new app nodes. And when that happens, it becomes difficult to re-deploy the application (sometimes it fails to deploy and sometimes it takes forever)
What's the best way to handle this situation? Catch the exception somewhere so the app starts? Or maybe there is a Service Fabric configuration that specifies how many times it will try to provision / start a new node before it gives up?
When a CodePackage crashes, Service Fabric uses a backoff to start it again.
You can modify the restart behavior on the cluster level.
If you change the ActivationRetryBackoffExponentiationBase behavior into 'Exponential' instead of 'Linear', that should give the cluster more quiet time between retries.
The value for ActivationMaxFailureCount specifies how many restart attempts should be made.
More info:
about cluster config
about service restart behavior

Are there scenarios where a Service Fabric Service is torn down but the host process is reused?

I am troubleshooting an issue where a service dependency is created in the Program.cs and passed into the Service Class. (for more context this is a stateless service, but my question applies for both) This services RunAsync method uses the CancellationToken supplied to determine if the service is still running. If the token gets cancelled then it calls dispose on the dependency. The symptom that I am diagnosing is that on start up sometimes the dependency is not initialized. I am pretty sure I read in the docs somewhere that the host process in some scenarios may be reused and not torn down when a service instance is torn down, but I can't seem to find it now.
Does the Host process outlive, and rehost new service instances in Service Fabric?
As far as I get it, if you have any replica around the process won't shut down. If there are no replicas left, the process will be closed after a grace interval.
See these discussions for more information - Processes keep running after service is deleted and Processes still keep running after Service Fabric App is removed.

Running a Windows Service as a statefull service in Service Fabric

I have three .net programs currently running as windows services. We are migrating to Service Fabric and I have a few questions. Our intent is to migrate the services to StateFul service since we need to keep track of locations of files, batch size, etc. that are currently stored in an app.config file. So we can "lift and shift" the code from the onTimer event to the RunAsync as discussed in this stackoverflow question:
How to Migrate Windows Service in Azure Service Fabric
However there are some questions I have about these services. Of course part of using SF is to have the applications in a reliable environment to keep these applications available as much as possible, so the first question is:
Should we only deploy the service to one node and use the reliable
collection to maintain the state of the process should the node go down and
have to be brought back up?
Or, should we deploy the application to say 3 nodes and just have each
application on their node check the reliable collection to see if another
application is processing files and to wait?
files?
The application will "awake" at a determined interval and look at a folder, if there are any files in the folder, it will process them. This could take from a couple of seconds to many minutes. So if the application on was on three nodes, it is entirely possible that the other two applications on their nodes would wake up to process files. If they could check a reliable dictionary to see if one of the other instances of the application is running the file processing, they would just wait until the next time they are needed.
I know this is vague, I am looking for input on whether to launch the application on multiple nodes or a single node?
In short: statefull services have partitioned data. So you will have at least one, and probably more than one, partition. For each partition a primary instance will be up and running serving requests or doing work. Then for each primary instance there will be some secundary instances that will take over when the primairy fails. More info here.
In the configuration of the service you specify the number of partitions and the replica count:
<Service Name="Processing">
<StatefulService ServiceTypeName="ProcessingType" TargetReplicaSetSize="[Processing_TargetReplicaSetSize]" MinReplicaSetSize="[Processing_MinReplicaSetSize]">
<UniformInt64Partition PartitionCount="[Processing_PartitionCount]" LowKey="0" HighKey="25" />
</StatefulService>
</Service>
The primairy and secundairy instances (replica's) will be distributed over the cluster nodes so for example, when the node running the primairy instance goes down a replica on another node will take over.
There is more to it than what I have described but this is the basic idea behind it all.
So to answer your question: you should specify enough replica's on other nodes to gurantuee high availabilty.

Upgrade Service Fabric Service that Fails to Honor Cancellation Token

I've got a stateful service running in a Service Fabric cluster that I now know fails to honor a cancellation token passed into it. My fault.
I'm ready to release the fix, but during the upgrade process, I'm expecting the service replica on the faulty primary node to get stuck since it won't honor the token passed in.
I can use Restart-ServiceFabricDeployedCodePackage or even Restart-ServiceFabricNode to manually take down the stuck replica, but that will result in a brief service interruption during the upgrade process.
Is there any way to release this fix with zero downtime?
This is not possible for a stateful service using the Service Fabric infrastructure, you will need to have downtime on the upgrade. Once you have a version that supports the cancellation token then you will be fine.
That said, depending on the use of the state, and if you have a load balancer between your clients and the service, you can stand up another service instance on the new fixed version and use the load balancer to drain your traffic across to then new version, upgrade the old, drain back to it and then drop the second service you created. This will allow for a zero downtime scenario.
The only workarounds I can think of are worse since they turn off parts of health checks during upgrades and "force" the process to come down. This doesn't make things more graceful or improve downtime, and has a side effect of potentially causing other health issues to be ignored.
There's always some downtime, even with the fully rolling upgrades, since swapping a primary to another node is never instantaneous and callers need to discover the new location. With those commands, you're just converting a more graceful shutdown and cleanup into a failure, which results in the same primary swap. Shouldn't be a huge difference since clients (and SF) have to deal with failure normally anyway.
I'd keep using those commands since they give you good manual control over which replicas/processes to poke when things get stuck.

Azure Development - How to stop a Web Role instance

I need to test how my code will handle the failure of a web role instance in a development environment.
How do I terminate one of the instances? I can't see any option in the UI for this. Seems like a strange ommission
Update
The issue is relating to a distributed cache layer (I know that azure offers their own)
I want to be able to test how the system reacts to a missing or additional node etc
Prehaps my real question is
how up to date is RoleEnvironment.CurrentRoleInstance.Role.Instances
The need to simulate ungraceful exits in the dev emulator usually is done because you are doing something in your web role that is stateful or long running. That is generally discouraged, but sometimes is unavoidable.
I suspect the best way to simulate the a failure is to kill processes. If you open task manager (or better Process Explorer), you will see "WatDebugger" hosting either "WaIISHost" or "WaWorkerHost". If you kill this process, I think it will simulate a failure.
Honestly, it is easier to test this one in the cloud however. You can RDP into one of the instances and kill the 'WaAppAgent' process. That will kill your RoleEntryPoint and fabric controller agent. That will be a true ungraceful failure.
By failure, do you mean becoming unavailable? It should be seamless because the next request would simply be handled by one of the other instances. As long as there is one instance available Azure will route calls to that instance.
This is the nature of a high-available system, requests are handled by the available instances. This is why you have multiple instances in the first place, to handle requests in the case of failure in one or more instances.
This is why you need to always be watchful of how your application handles state. State needs to be maintained outside of the instance, either in queues or in a database. This ensures that any process can pickup a piece of work and execute against it.
There is another question dealing with Session State that should help: How does Microsoft Azure handle Session State?
By terminate an instance do you mean reducing instance count and see which one gets killed? I like Ryan's view about ungraceful exits, but if it's forced kill by the fabric it'll be a different ball game.

Resources