I have a Service Fabric app, if the configuration is incorrect, I get an exception building my web host. For this particularly scenario, the app is never going to start, and Service Fabric is going to continually try to start / provision new app nodes. And when that happens, it becomes difficult to re-deploy the application (sometimes it fails to deploy and sometimes it takes forever)
What's the best way to handle this situation? Catch the exception somewhere so the app starts? Or maybe there is a Service Fabric configuration that specifies how many times it will try to provision / start a new node before it gives up?
When a CodePackage crashes, Service Fabric uses a backoff to start it again.
You can modify the restart behavior on the cluster level.
If you change the ActivationRetryBackoffExponentiationBase behavior into 'Exponential' instead of 'Linear', that should give the cluster more quiet time between retries.
The value for ActivationMaxFailureCount specifies how many restart attempts should be made.
More info:
about cluster config
about service restart behavior
Related
I've got a .NET worker service based on a cron schedule running in a Docker container and pushed up to Azure Container Apps. The schedule is managed within the application itself.
The scaling is set to have a minumum of 1 replica running at all times.
However, we've found that for some reason the application starts up, idles waiting for the schedule trigger for ~20-30 seconds, stops for 2 seconds, starts and idles for ~20-30 seconds again and then doesn't run again for ~5-6 minutes. During the idling time, the job might start if the cron schedule lines up while the process is running.
Is there any way to diagnose why it might be auto-killing the application?
I can't seem to find any logs that show any fatal exceptions or along those lines, and running in other environments (locally, Azure Container Instance etc.) doesn't replicate the behavior. My suspicion is that it's the auto-scaling behavior: Azure is noticing that the process is idle for 20-30 seconds at a time and killing that replica, only for it to spin up again 5 mins later. However, I can't seem to find anything to prove that theory.
I'm aware that other resource types might be better suited (Container Instances, App Service, Functions) though for now I'm stuck with Container Apps.
Found the cause of the issue based on this SO question:
Azure Container Apps Restarts every 30 seconds
Turns out, Azure was trying to do health checks on it despite no HTTP ports being exposed. Azure thinking the container is unhealthy, kills and restarts it. Turning off HTTP ingress (and therefore the health checks) solved this issue.
I am troubleshooting an issue where a service dependency is created in the Program.cs and passed into the Service Class. (for more context this is a stateless service, but my question applies for both) This services RunAsync method uses the CancellationToken supplied to determine if the service is still running. If the token gets cancelled then it calls dispose on the dependency. The symptom that I am diagnosing is that on start up sometimes the dependency is not initialized. I am pretty sure I read in the docs somewhere that the host process in some scenarios may be reused and not torn down when a service instance is torn down, but I can't seem to find it now.
Does the Host process outlive, and rehost new service instances in Service Fabric?
As far as I get it, if you have any replica around the process won't shut down. If there are no replicas left, the process will be closed after a grace interval.
See these discussions for more information - Processes keep running after service is deleted and Processes still keep running after Service Fabric App is removed.
We use Service Fabric to deploy stateless microservices. One of the microservices is designed as a singleton. This means it is designed to be deployed on a single node only:
InstanceCount = 1
Normally, if there is more than 1 instance and one fails, the others keep working.
But how does the single instance behave? I cannot find this scenario in the documentation. I only found out that when the node is updated and the parameter IsSingletonReplicaMoveAllowedDuringUpgrade is set to true then it can be moved to other node, but no source explicitly says what happens when the singleton fails during execution.
Does it restart automatically? And if so, then how long is the downtime?
Service Fabric will restart the service for you automatically. The time it takes to restart can depend on how loaded the machine is, how large the service is, and they type of failure, but is typically within a couple of seconds.
The amount of time it takes to restart can also depend on how the service failed. Process crashes are quicker to recover from. Machine failures or networking cuts can take longer to detect, but even in these cases SF will usually restart things within 10-30 seconds.
I'm facing following problem with MassTransit 3. I'm publishing messages from WebApi to Backend (ran as continuous webjob). When the backend job is started all works well and messages are picked up properly. After cca 20 minutes all messages published from WebApi stop being picked up by the backend. The message is published to the Azure Service Bus properly but is picked up only after restart of the webjob process.
MT debug log is completely silent and shows no issues. So this question is more for authors of MT if they could think of anything that could cause this issue.
Update 1
The web job is continuous and running in standard mode, therefore the 20minute timeout mentioned in azure documentation shouldn't apply.
I've checked the logs and the job is running. Environment doesn't log anything about stopping the job and the process explorer shows the job. With quite high thread count (I have just 3 consumers). All threads are in wait state.
You should be creating a cloud service and not a web job. Web jobs are not meant for continuous processes. A worker role is exactly what you need.
From the Azure documentation:
Web apps in Free mode can time out after 20 minutes if there are no requests to the scm (deployment) site and the web app's portal is not open in Azure. Requests to the actual site will not reset this.
Resolved. The MT process got stuck after spawning around 2k threads. The issue must have been in azure transport as trying the same configuration with Rabbit worked well.
After updating to newer MT version (.11 beta), the transport started to behave properly.
I noticed that for some reason one of the web role stopped and restarted itself. Could someone help me to understand on what scenarios does the webrole restarts itself?
And, is there any way to find why the webrole restarted itself?
That happens once in a while when Azure performs guest OS upgrades - it stops instances honoring upgrade domains and then starts them shortly thereafter. This is the most frequent scenario, the same could happen if the server hosting the VM was diagnosed faulty, but that happens quite rarely.
You should be ready for such restarts - they are normal - and your code should be designed to be able to continue working after such restart.
Here's a post with more details on the upgrade process.