Waiting for a service to be ready (Service Fabric) - azure

I have four services running on Azure Service Fabric, but two of those 4 services depend on another one, is there a way to make a service initialization wait until another service announces it is ready?

No. There's no ordering to service creation (services can be created at any time, not just during a deployment from your build machine), and what does it even mean for your service to be ready? From our perspective it means the Failover Manager found nodes that the service is able to run on and the code packages have been activated on those nodes. The platform doesn't know what your service code does though. From your perspective it probably means "when it's responding to my requests" otherwise it's not "ready," which can happen at any time during the service's lifetime for any number of reasons:
Service was just deployed and its communication stack hasn't opened an endpoint yet
Service instance/replica moved and its communication stack is spinning back up on a new node
Service partition is in quorum loss and not accepting write operations
etc.
This is an ongoing thing that your services need to be prepared to handle. If two of services can't do any work until they are able to talk to another service, then they need to poll for that service they depend on until it's available through an endpoint on that service that you define.

Related

Architecture recommendation - Azure Webjob

I have a webjob that subscribes to an Azure service Bus topic. The webjob automates a very important business process. For the Service bus, it is Premium SKU and have Geo-Recovery configured. My question is about the best practice to setup High Availability for my webjob (to ensure that the process runs always). I already have the App Service Plan deployed in two regions, and the webjob is installed in both the regions. However, I would like my webjob in the secondary region to run only if the primary region is down - maybe temporarily due to an outage. How can this be implemented? If I run both the webjob in parallel, that will create some serious duplication issues. Is there any architectural pattern I can refer to, or use any features within App Service or Azure to implement this?
With ServiceBus, when you can pick up a message, it is locked so shouldn't be picked up by another process unless the lock time expires or you issue a compled message back to service bus. In your case, if you are using Peek Lock, you can use it to prevent the same message being picked up by different instances. See docs
You can also make use of sessions which is available in the premium instance of ServiceBus. In this way, you can group messages to a session and each service instance handles their own session unless the other instance is not available.
Since WebJob is associated with App service , so really depends how you have configured this. You already mentioned that WebJobs are in 2 regions which mean you have app services running in 2 regions. (make sure you have multiple instance running in each region and different Availability zones).
Now it comes down what configuration you have regarding standby region. Is it Active/passive with hot Standby, Active/passive with cold Standby or is it active/Active. If your secondary region is Active where you have atleast one instance running then your webjob is actually processing the message.
I would recommend read through these patterns and understand.
Standby Regions Configuration , Multi Region Config
Regarding Service bus, When you are processing the message with Peek-Lock it means the message is not visible in the queue so no other instance would pick up. If your webjob is not able to process in time or failed to do or crash , the message become visible in the queue again and any other instance can pick it up so no two instances can pick same message.
Better Approach
I would recommend using Azure functions to process queue message .They are serverless offering with free invocations credit a month and are naturally highly available.
You can find more about here
Azure Function Svc Bus Trigger

How to maintain state in a Service Fabric microservice deployed in multiple clusters accessing external resource

I am trying to make my Service Fabric service, which makes a SOAP call to an external service, such that if deployed over 2 or more clusters, it can still work, in that if one service has made the connection to the external service, then the service in the other cluster doesn't try to make the connection, and vice versa.
I can't think of a better way to design this without storing the state in a database, which introduces a host of issues such as locking and race conditions, etc. What are some designs that can fit in this scenario. Any suggestions would be highly appreciated.
There is no way to do that out of the box on Service Fabric.
You have to find an approach to orchestrate these calls between clusters\services, you could:
Create a service in one of the clusters to delegate the calls to other services, and store the info about connections on a single service.
put a message in a queue and each service get one message to open a connection(this can be one of the approaches used above)
Store in a shared cache(redis) every active call, before you attempt to make the call you check if the connection is already active somewhere, when the connection close you remove from the cache for other services be able to open the connection, also enable expiration to close these connections in case of service failure.
Store the state in a database as you suggested

Deleting an idle Stateful Service in Service Fabric

I have a set of user-specific stateful services servicing requests forwarded from a public-facing stateless service (web API) in an app.
I'm trying to delete a stateful service if it has not serviced any user request since a given time interval, say an hour. Currently, I'm managing this by keeping a .NET timer in the service itself and using the tick event to self-destruct the service if it's been idle.
Is this the right way to do it? Or is there any other more efficient approach to do this in Azure service fabric?
The mechanism you have will work great and is what we'd normally recommend.
Another way to do it would be to have a general "service manager" service that periodically checked to see if services were busy, (or were informed) and which could kick off the deleteserviceasync call. That way only that service would need the cluster admin rights, while all the others could get locked down to read only.

Migration to Azure Service Fabric - Architectural considerations

We are on Azure since 2010 and had a great benefit from a performance and reliability in our application. Azure offers a lot of enterprise-level services and I think that the new "Azure Service Fabric" is great.
What I cannot understand by reading the documentation is the approach on migrating an "old" Cloud Service to the new Service Fabric. Why do we want to migrate? For horizontal scaling and more reliability.
Currently we have a single-instance cloud service, that spins up a lot of subservices. Those subservices are great candidates for microservices. The only problem is that some of these subservices are "runners", i.e. they just cycle on our users database and decide whether an operation (service) has to be run for a particular user or not.
How would you migrate a service like this considering that more than one instance may run this service?
Thanks
First thing to keep in mind is that once a service is started it keeps running, and his lifecycle and uptime is controlled by Service Fabric (ex: it will restart it automatically if it crashes). Second thing to keep in mind is that you will end-up with multiple instances of the service running at the same time (on different nodes), so they will end-up doing the exact same thing on different nodes of your cluster.
Your first reflex could be to have one stateless service kind/instance per runner "subservice" that keeps running and leverage the RunAsync (https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-advanced-usage). Personally, I wouldn't take that approach, since this could then require some kind of synchronization between services to prevent useless concurrency, since they do the exact same thing independently.
A better approach would be to have your runner services need to run only once in a while when requested by the "main" service acting as an orchestrator, you could have a Queue based approach where the "main" service submit tasks (messages) to be processed by the runners, who are listening concurrently on the same Queue, making sure that maximum one service instance would complete the task.
For the Queue, think Service Bus or Reliable Concurrent Queue (https://learn.microsoft.com/enus/dotnet/api/microsoft.servicefabric.data.collections.preview.ireliableconcurrentqueue-1).

Azure - RoleEnvironement events not firing when role is rebooted or taken down for patching

In short, is there a RoleEnvironment event that I can handle in code when any other role in my deployment is rebooted or taken offline for patching?
I've got an application in production that has both web roles for an web front end and web roles running WCF services as an application layer (business logic, data access etc). The web layer communicates with the WCF layer over an internal endpoint as we don't want to expose the services at this point in time. So this means it is not possible to use the load balancer to call my service layer through a single url.
So I have to load balance requests to the WCF web roles manually. This has caused problems in the past when a machine has been recycled by the fabric controller for patching.
I'm handling the RoleEnvironment.Changing and RoleEnvironment.Changed events to adjust the list of backend web roles I am communicating with, which works well in testing when I make a configuration change to increase or decrease the number of instances in my deployment. But if I reboot a role through the portal, this does not fire the RoleEnvironment events.
Thanks,
Rob
RoleEnvironment.Changing will be fired "before a change to the service configuration" (my emphasis). In this case no configuration change is occurring, your service is still configured to have exactly the same number of instances. AFAIK there is no way to know when your deployment is taken offline, and clearly their are instances where notice cannot be given in advance (e.g. hardware failure). Therefore you have to code for communication failure, intercept the error, and try another role instance.
I do not believe you can intercept RoleEnvironment changes from a different Role easily.
I would suggest that you have RoleEnvironment changes trapped in the Role where they occur, handle them by throwing a message/record onto some persisted storage and let your Web-roles check that storage either on a regular schedule or every-time when you communicate to the WCF-roles.
Basically, if you're doing your own internal load-balancing, you need a mechanism for registration/tear-down of your instances so that you can manage your wcf workers
You can use the Azure storage queues to post a message when a role is going down and have a worker role that listens on that queue and adjusts things accordingly.

Resources