Architecture recommendation - Azure Webjob - azure

I have a webjob that subscribes to an Azure service Bus topic. The webjob automates a very important business process. For the Service bus, it is Premium SKU and have Geo-Recovery configured. My question is about the best practice to setup High Availability for my webjob (to ensure that the process runs always). I already have the App Service Plan deployed in two regions, and the webjob is installed in both the regions. However, I would like my webjob in the secondary region to run only if the primary region is down - maybe temporarily due to an outage. How can this be implemented? If I run both the webjob in parallel, that will create some serious duplication issues. Is there any architectural pattern I can refer to, or use any features within App Service or Azure to implement this?

With ServiceBus, when you can pick up a message, it is locked so shouldn't be picked up by another process unless the lock time expires or you issue a compled message back to service bus. In your case, if you are using Peek Lock, you can use it to prevent the same message being picked up by different instances. See docs
You can also make use of sessions which is available in the premium instance of ServiceBus. In this way, you can group messages to a session and each service instance handles their own session unless the other instance is not available.

Since WebJob is associated with App service , so really depends how you have configured this. You already mentioned that WebJobs are in 2 regions which mean you have app services running in 2 regions. (make sure you have multiple instance running in each region and different Availability zones).
Now it comes down what configuration you have regarding standby region. Is it Active/passive with hot Standby, Active/passive with cold Standby or is it active/Active. If your secondary region is Active where you have atleast one instance running then your webjob is actually processing the message.
I would recommend read through these patterns and understand.
Standby Regions Configuration , Multi Region Config
Regarding Service bus, When you are processing the message with Peek-Lock it means the message is not visible in the queue so no other instance would pick up. If your webjob is not able to process in time or failed to do or crash , the message become visible in the queue again and any other instance can pick it up so no two instances can pick same message.
Better Approach
I would recommend using Azure functions to process queue message .They are serverless offering with free invocations credit a month and are naturally highly available.
You can find more about here
Azure Function Svc Bus Trigger

Related

Long Running Tasks in Service Fabric and Scaling Cluster In

We are using Azure Service Fabric (Stateless Service) which gets messages from the Azure Service Bus Message Queue and processes them. The tasks generally take between 5 mins and 5 hours.
When its busy we want to scale out servers, and when it gets quiet we want to scale back in again.
How do we scale in without interrupting long running tasks? Is there a way we can tell Service Fabric which server is free to scale in?
Azure Monitor Custom Metric
Integrate your SF service with
EventFlow. For instance, make it sending logs into Application Insights
While your task is being processed, send some logs in that will indicate that
it's in progress
Configure custom metric in Azure Monitor to scale in only in case on absence of the logs indicating that machine
has in-progress tasks
The trade-off here is to wait for all the events finished until the scale-in could happen.
There is a good article that explains how to Scale a Service Fabric cluster programmatically
Here is another approach which requires a bit of coding - Automate manual scaling
Develop another service either as part of SF application or as VM extension. The point here is to make the service running on all the nodes in a cluster and track the status of tasks execution.
There are well-defined steps how one could manually exclude SF node from the cluster -
Run Disable-ServiceFabricNode with intent ‘RemoveNode’ to disable the node you’re going to remove (the highest instance in that node type).
Run Get-ServiceFabricNode to make sure that the node has indeed transitioned to disabled. If not, wait until the node is disabled. You cannot hurry this step.
Follow the sample/instructions in the quick start template gallery to change the number of VMs by one in that Nodetype. The instance removed is the highest VM instance.
And so forth... Find more info here Scale a Service Fabric cluster in or out using auto-scale rules. The takeaway here is that these steps could be automated.
Implement scaling logic in a new service to monitor which nodes are finished with their tasks and stay idle to scale them in using instructions described in previous steps.
Hopefully it makes sense.
Thanks a lot to #tank104 for the help on elaborating my answer!

Migration to Azure Service Fabric - Architectural considerations

We are on Azure since 2010 and had a great benefit from a performance and reliability in our application. Azure offers a lot of enterprise-level services and I think that the new "Azure Service Fabric" is great.
What I cannot understand by reading the documentation is the approach on migrating an "old" Cloud Service to the new Service Fabric. Why do we want to migrate? For horizontal scaling and more reliability.
Currently we have a single-instance cloud service, that spins up a lot of subservices. Those subservices are great candidates for microservices. The only problem is that some of these subservices are "runners", i.e. they just cycle on our users database and decide whether an operation (service) has to be run for a particular user or not.
How would you migrate a service like this considering that more than one instance may run this service?
Thanks
First thing to keep in mind is that once a service is started it keeps running, and his lifecycle and uptime is controlled by Service Fabric (ex: it will restart it automatically if it crashes). Second thing to keep in mind is that you will end-up with multiple instances of the service running at the same time (on different nodes), so they will end-up doing the exact same thing on different nodes of your cluster.
Your first reflex could be to have one stateless service kind/instance per runner "subservice" that keeps running and leverage the RunAsync (https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-advanced-usage). Personally, I wouldn't take that approach, since this could then require some kind of synchronization between services to prevent useless concurrency, since they do the exact same thing independently.
A better approach would be to have your runner services need to run only once in a while when requested by the "main" service acting as an orchestrator, you could have a Queue based approach where the "main" service submit tasks (messages) to be processed by the runners, who are listening concurrently on the same Queue, making sure that maximum one service instance would complete the task.
For the Queue, think Service Bus or Reliable Concurrent Queue (https://learn.microsoft.com/enus/dotnet/api/microsoft.servicefabric.data.collections.preview.ireliableconcurrentqueue-1).

Waiting for a service to be ready (Service Fabric)

I have four services running on Azure Service Fabric, but two of those 4 services depend on another one, is there a way to make a service initialization wait until another service announces it is ready?
No. There's no ordering to service creation (services can be created at any time, not just during a deployment from your build machine), and what does it even mean for your service to be ready? From our perspective it means the Failover Manager found nodes that the service is able to run on and the code packages have been activated on those nodes. The platform doesn't know what your service code does though. From your perspective it probably means "when it's responding to my requests" otherwise it's not "ready," which can happen at any time during the service's lifetime for any number of reasons:
Service was just deployed and its communication stack hasn't opened an endpoint yet
Service instance/replica moved and its communication stack is spinning back up on a new node
Service partition is in quorum loss and not accepting write operations
etc.
This is an ongoing thing that your services need to be prepared to handle. If two of services can't do any work until they are able to talk to another service, then they need to poll for that service they depend on until it's available through an endpoint on that service that you define.

Azure Service Bus: High Availability

I'm currently building a hybrid-cloud solution that needs to write messages to a queue for later processing. It is absolutely imperative that the queue is highly available (99.999+% uptime).
My options are to read/write messages to a local ZeroMQ high availability pair, or an Azure Service Bus. I would prefer to go the Azure Service Bus route, but can't find any documentation regarding high availability configuration for Azure Service Bus.
Has anyone had success setting up Azure Service Bus for high availability? I understand that the SLA for a single instance of any Azure service cannot be changed. I'm thinking more along the lines of the failover capabilities of Azure Web Apps.
The main thing you can do for consuming a service at a higher than SLA value is to ensure you are handling retry logic. The key here will be the temporal nature of any outage, and tuning a retry backoff to handle edge cases. Some use linear or exponential backoffs to wait even longer for the service to come back up.
Also, you can have more than one service bus in a different region for georedundancy, and either load balancing messages across the two or use one as a hot backup. This can get you around any regional outages and keep your service up when one data center is not meeting its local SLA.
You can find the for SLA for Azure Service Bus here: legal/sla/service-bus/v1_0/
For Service Bus Relays, we guarantee that at least 99.9% of the time,
properly configured applications will be able to establish a
connection to a deployed Relay. For Service Bus Queues and Topics, we
guarantee that at least 99.9% of the time, properly configured
applications will be able to send or receive messages or perform other
operations on a deployed Queue or Topic. For Service Bus Basic and
Standard Notification Hub tiers, we guarantee that at least 99.9% of
the time, properly configured applications will be able to send
notifications or perform registration management operations with
respect to a Notification Hub. For Event Hubs Basic and Standard
tiers, we guarantee that at least 99.9% of the time, properly
configured applications will be able to send or receive messages or
perform other operations on the Event Hub.
We've had Service Bus Relay up and running for 5+ years and have had one outage. It was an outage at the specific data center the relay was provisioned in and touched many services. After that we implemented redundancy by implementing a secondary Service Bus Relay namespace in a different data center location. The reconfigured code was set to check the connectivity on every connection and switch the primary and secondary connections. We treated them as equals so once we "failed over" that namespace would become primary.
Service Bus now supports Geo-disaster recovery and Geo-replication at the namespace level.
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-geo-dr

Windows Azure Inter-Role communication

I want to create an Azure application which does the following:
User is presented with a MVC 4 website (web role) which shows a list of commands.
When the user selects a command, it is broadcast to all worker roles.
Worker roles process the task, store the results and notify web role
Web role displays the combined results of the worker roles
From what I've been reading there seem to be two ways of doing this: the Windows Azure Service Bus or using Queues. Each worker role also stores the results in the database.
The Service Bus seems more appropriate with its publish/subscribe model, so all worker roles would get the same command and roughly the same time. Queues seem easier to use though.
Can the service bus be used locally with the emulator when developing? I am using a free trial and cannot keep the application constantly whilst still developing. Also, when using queues how can you notify the web role that processing is complete?
I agree. ServiceBus is a better choice for this messaging requirement. You could, with some effort, do the same with queues. But, you'll be writing a lot of code to implement things that the ServiceBus already gives you.
There is not a local emulator for ServiceBus like there is for the Azure Strorage service (queues/tables/blobs). However, you could still use the ServiceBus for messaging between roles while they are running locally in your development environment.
As for your last question about notifying the web role that processing is complete, there are a several ways to go here. Just a few thoughts (not exhaustive list)...
Table storage where the web role can periodically check the status of the unit of work.
Another ServiceBus Queue/topic for completed work.
Internal endpoints. You'll have to have logic to know if it's just an update from worker role N or if it is indicating a completed unit of work for all worker roles.
I agree with Rick's answer, but would also add the following things to think about:
If you choose the Service Bus Topic approach then as each worker role comes online it would need to generate a subscription to the topic. You'll need to think about subscription maintenance of when one of the workers has a failure and is recycled, or any number of reasons why a subscription may be out there.
Telling the web role that all the workers are complete is interesting. The options Rick provides are good ones, but you'll need to think about some things here. It means that the web role needs to know just how many workers are out there or some other mechanism to decide when all have reported done. You could have the situation of five worker roles receieving a message and start working, then one of them starts to repeatedly fail processing. The other four report their completion but now the web role is waiting on the fifth. How long do you wait for a reply? Can you continue? What if you just told the system to scale down and while the web role thinks there are 5 there is now only 4. These are things you'll need to to think about and they all depend on your requirements.
Based on your question, you could use either queue service and get good results. But each of them are going to have different challenges to overcome as well as advantages.
Some advantages of service bus queues is that it provides blocking receipt with a persistent connection (up to 100 connections), it can monitor messages for completion, and it can send larger messages (256KB).
Some advantages of storage queues over the service bus solution is that it's slightly faster (if 15 ms matters to you), you can use a single storage system (since you'll probably be using Storage for blob and table services anyways), and simple auto-scaling. If you need to auto-scale your worker roles based on the load, passing the the requests through a storage queue makes auto-scaling trivial -- you just setup auto-scaling in the Azure Cloud Service UI under the scale tab.
A more in-depth comparison of the two azure queue services can be found here: http://msdn.microsoft.com/en-us/library/hh767287.aspx
Also, when using queues how can you notify the web role that processing is complete?
For the Azure Storage Queues solution, I've written a library that can help: https://github.com/brentrossen/AzureDistributedService.
It provides a proxy layer that facilitates RPC style communication from web roles to worker roles and back through Storage Queues.

Resources