We have a Functions App which scales to a couple of hundred instances under peak load, now we need a way in which if a particular event happens (maybe a new message in the queue) all the instances are notified, what are the potential approaches to achieve this, please advise..
One pattern is to have a common marker (a blob) which changes when the common state if updated. Each instance can proactively check the blob's etag to determine if it has the state has changed, and if so, it knows to reload its state.
Note that:
it's important for the instance to check (rather than wait for a change notification) because there can be a lag in most notification mechanisms. For example, a blob trigger could lag several minutes.
There's no way to bypass the load-balancer and send a message to a specific instance. So you can't proactively send an invalidation message.
Another pattern is to have you state fully externalized in something like Redis. That's easy to invalidate and update. (Although that's essentially just a special case of the prior suggestion)
Related
I am creating a poll app. Users can define one or more poll questions and configure answer options. Guests can join a session and when a poll (question) is activated, start voting. Basically what a standard poll looks like.
For processing the incoming votes, I use the Azure Service Bus. I have an endpoint that accepts votes and sends a message to a Service Bus Queue. Then, an Azure Function with a Service Bus Queue trigger will consume that message and persist the vote somewhere in a repository.
My problem is that I want another 'background process', I imagine another Azure Function, that will be triggered when votes come in, to go and calculate the cumulative of votes to be able to draw a pie chart.
Now I want this Function to be triggered as efficiently as possible. Key is that it must be accurate. What I'm looking for, is a method that will trigger the calculation once when a vote comes in, but when a bunch of votes comes in, I want to trigger the calculation only once after the last vote was persisted. I was thinking of introducing a new queue to send 'calculation commands' to. I use a real-time framework to update the pie chart. I would like to send pie-chart updates frequently, but not necessarily thousands of times a second when huge amounts of votes came in in a short amount of time.
I looked for a solution where I can use the de-duplication of an SB queue, but I think this de-dup also checks for previously sent messages. And using this solution does not guarantee that the calculation takes place after the last vote has been processed, because the message may be recognized as a duplicate and therefore ignored.
Another solution may be to introduce a SessionId for the votes queue allowing me to overcome the problem that vote messages are handled simultaneously, but this feels like an anti-pattern using the Service Bus. In the end, you want the thing to scale like a maniac when large amounts of votes come in, so for that reason, the session is a no go to me.
And now I'm running out of ideas, is there a mechanism that I overlooked that I can take advantage of to (for example) only put a message on a queue when there is no similar message waiting to be processed (e.g. without a lock) or something?
You can trigger the Function using one of the available Event Grid events for Service Bus, if the concern is that you don't want a listener to run at all times.
The Azure Functions approach suggested by Clemens is a viable approach. You probably don't need Event Grid because your function could be triggered by the Service Bus queue.
I want to trigger the calculation only once after the last vote was persisted.
If there is a way to indicate voting period is over, you could have a 2nd function that runs the calculations from the data stored by processing voting messages. One thing to watch out for is how the 1st function that accepts the voting messages stores the data. If the data is stored in append-only mode, you're good. If you're trying to keep a counter only, you'll have contention and don't recommend that approach. Append only is a more efficient approach.
In a Producer-Consumer case with multiple app instances, I know I am supposed to have some type of queue for the distribution of events to the consumers. But how do I deal with the producer?
I must query a database for objects with an expired deadline every minute. That will push work to a message queue, so distribution is not a problem. My concern is that if I have multiple instances of the app, I have to make sure that only one is producing work.
Am I supposed to solve this electing a cluster leader? Is there a common algorithm or library in NodeJS for this? My guess is that I will have to reach for some magic Redis command and make my instances aware of each other.
There are always many different ways to achieve things, but my suggestion is to create an idempotent outbox table in your database, where multiple producers throw the records to be published to the message queue.
Then, you can deploy a tool like Debezium that does transaction log tailing (reads the database transaction log) and pushes the message to whatever message queue technology you're using.
Please note that it's also a good practice to implement the idempotency check on your consumers to make sure they don't process the same message twice.
Wix - How We Implemented Idempotency in a Billing System at Scale
I want to detect the fact that an instance of my Azure role has crashed. Detection in my case means that another instance of my role is notified about the crash. Please review my idea explained below or propose another solution.
The idea I came up with takes advantage of the fact that items in the Azure Queue have limited processing time.
Configure an Azure Queue. All instances of the role listen to this queue.
Configure role instances to have internal endpoint
When instance A starts it posts a message to the queue. The message contains the id of instance A, the IP of A's internal endpoint, the marker that this message should be forwarded back to A.
Most likely the message ends up on another instance B. B will forward the MessageId and PopReceipt to A via internal endpoint. Instance A creates a object of CloudQueueMessage using this ctr http://msdn.microsoft.com/en-us/library/dn451949.aspx.
Instance A starts updating the visibility timeout of received message infinitely. From Azure Queue point of view this message will be being processed for a very long time. In the first update A removes "forward-this-message" marker.
If instance A crashes it stops prolonging the processing. The message will become visible automatically for other instances soon.
Instance C picks up the message and learns about crashed A: message contains the ID of instance A and no "forward-this-message" marker.
If instance A stops gracefully it marks its queue message as processed.
This all seems very convoluted.
Personally, I would go back and look at the original assumption that I need to know when an instance crashes - and consider what I do with that information. I would favor an optimistic solution (i.e., assume success and handle failure) rather than the pessimistic solution (i.e., assume failure so provide some mechanism to ensure success). One problem with the latter is that you are going to have to handle undeclared instance crashes anyway - so why not make that the default behavior. That is invoke the operation on the instance - and handle any failure that occurs.
For example, if I want to invoke an operation on an internal endpoint on another instance I would load balance against all the other instances and, on detecting a failed instance, try the operation on another instance. Ryan Dunn has what is now an ancient post on, among other things, load balancing against internal endpoints.
My basic point is that it is going to be hard to robustly perform this type of orchestration with messages being passed from one instance to another. There are just too many possible failure points. It would be better to come up with a solution that more directly addresses the underlying need. A simple solution is almost always preferable to a more complex solution.
I am looking for a way to have a "Singleton" module over multiple worker role instances.
I would like to have a parallel execution model with Queues and multiple worker roles in Azure.
The idea is that would like to have a "master" instance, that is let's say checking for new data, and is scheduling it by adding it to a queue, processing all messages from a special queue, that is not processed by nobody else, and has mounted blob storage as a virtual drive, with read/write access.
I will always have only one "master instance". When that master instance goes down for some reason, another instance from the one already instantiated should very quickly be "elected" for a master instance (couple of seconds). This should happen before the broken instance is replaced by a new one by the Azure environment (about 15 min).
So it will be some kind of self-organizing, dynamic environment.
I was thinking of having some locking, based on a storage or table data. the opportunity to set lock timeouts and some kind of "watchdog" timer if we can talk with microprocessor terminology.
There is general approach to what you seek to achieve.
First, your master instance. You could do your check based on instance ID. It is fairly easy. You need RoleEnvironment.CurrentRoleInstance to get the "Current instance", now compare the Id property with what you get out of RoleEnvironment.CurrentRoleInstance.Role.Instances first member ordered by Id. Something like:
var instance = RoleEnvironment.CurrentRoleInstance;
if(instance.Id.Equals(instance.Role.Instances.OrderBy(ins => ins.Id).First().Id))
{
// you are in the single master
}
Now you need to elect master upon "Healing"/recycling.
You need to get the RoleEnvironment's Changed event. Check if it is TopologyChange (just check whether it is topology change, you don't need the exact change in topology). And if it is Topology Change - elect the next master based on the above algorithm. Check out this great blog post on how to exactly perform events hooking and change detection.
Forgot to add.
If you like locks - blob lease is the best way to acquire / check locks. However working with just the RoleEnvironment events and the simple master election based on Instance ID, I don't think you'll need that complicated locking mechanism. Besides - everything lives in the Queue until it is successfully processed. So if the master dies before it processes something, the "next master" will process it.
I have a simple work role in azure that does some data processing on an SQL azure database.
The worker basically adds data from a 3rd party datasource to my database every 2 minutes. When I have two instances of the role, this obviously doubles up unnecessarily. I would like to have 2 instances for redundancy and the 99.95 uptime, but do not want them both processing at the same time as they will just duplicate the same job. Is there a standard pattern for this that I am missing?
I know I could set flags in the database, but am hoping there is another easier or better way to manage this.
Thanks
As Mark suggested, you can use an Azure queue to post a message. You can have the worker role instance post a followup message to the queue as the last thing it does when processing the current message. That should deal with the issue Mark brought up regarding the need for a semaphore. In your queue message, you can embed a timestamp marking when the message can be processed. When creating a new message, just add two minutes to current time.
And... in case it's not obvious: in the event the worker role instance crashes before completing processing and fails to repost a new queue message, that's fine. In this case, the current queue message will simply reappear on the queue and another instance is then free to process it.
There is not a super easy way to do this, I dont think.
You can use a semaphore as Mark has mentioned, to basically record the start and the stop of processing. Then you can have any amount of instances running, each inspecting the semaphore record and only acting out if semaphore allows it.
However, the caveat here is that what happens if one of the instances crashes in the middle of processing and never releases the semaphore? You can implement a "timeout" value after which other instances will attempt to kick-start processing if there hasnt been an unlock for X amount of time.
Alternatively, you can use a third party monitoring service like AzureWatch to watch for unresponsive instances in Azure and start a new instance if the amount of "Ready" instances is under 1. This will save you can save some money by not having to have 2 instances up and running all the time, but there is a slight lag between when an instance fails and when a new one is started.
A Semaphor as suggested would be the way to go, although I'd probably go with a simple timestamp heartbeat in blob store.
The other thought is, how necessary is it? If your loads can sustain being down for a few minutes, maybe just let the role recycle?
Small catch on David's solution. Re-posting the message to the queue would happen as the last thing on the current execution so that if the machine crashes along the way the current message would expire and re-surface on the queue. That assumes that the message was originally peeked and requires a de-queue operation to remove from the queue. The de-queue must happen before inserting the new message to the queue. If the role crashes in between these 2 operations, then there will be no tokens left in the system and will come to a halt.
The ESB dup check sounds like a feasible approach, but it does not sound like it would be deterministic either since the bus can only check for identical messages currently existing in a queue. But if one of the messages comes in right after the previous one was de-queued, there is a chance to end up with 2 processes running in parallel.
An alternative solution, if you can afford it, would be to never de-queue and just lease the message via Peek operations. You would have to ensure that the invisibility timeout never goes beyond the processing time in your worker role. As far as creating the token in the first place, the same worker role startup strategy described before combined with ASB dup check should work (since messages would never move from the queue).