Detecting crashes of Azure instances

Detecting crashes of Azure instances - azure

I want to detect the fact that an instance of my Azure role has crashed. Detection in my case means that another instance of my role is notified about the crash. Please review my idea explained below or propose another solution.
The idea I came up with takes advantage of the fact that items in the Azure Queue have limited processing time.
Configure an Azure Queue. All instances of the role listen to this queue.
Configure role instances to have internal endpoint
When instance A starts it posts a message to the queue. The message contains the id of instance A, the IP of A's internal endpoint, the marker that this message should be forwarded back to A.
Most likely the message ends up on another instance B. B will forward the MessageId and PopReceipt to A via internal endpoint. Instance A creates a object of CloudQueueMessage using this ctr http://msdn.microsoft.com/en-us/library/dn451949.aspx.
Instance A starts updating the visibility timeout of received message infinitely. From Azure Queue point of view this message will be being processed for a very long time. In the first update A removes "forward-this-message" marker.
If instance A crashes it stops prolonging the processing. The message will become visible automatically for other instances soon.
Instance C picks up the message and learns about crashed A: message contains the ID of instance A and no "forward-this-message" marker.
If instance A stops gracefully it marks its queue message as processed.

This all seems very convoluted.
Personally, I would go back and look at the original assumption that I need to know when an instance crashes - and consider what I do with that information. I would favor an optimistic solution (i.e., assume success and handle failure) rather than the pessimistic solution (i.e., assume failure so provide some mechanism to ensure success). One problem with the latter is that you are going to have to handle undeclared instance crashes anyway - so why not make that the default behavior. That is invoke the operation on the instance - and handle any failure that occurs.
For example, if I want to invoke an operation on an internal endpoint on another instance I would load balance against all the other instances and, on detecting a failed instance, try the operation on another instance. Ryan Dunn has what is now an ancient post on, among other things, load balancing against internal endpoints.
My basic point is that it is going to be hard to robustly perform this type of orchestration with messages being passed from one instance to another. There are just too many possible failure points. It would be better to come up with a solution that more directly addresses the underlying need. A simple solution is almost always preferable to a more complex solution.

Related

How to intimate a Event to all the instances of Azure Function

We have a Functions App which scales to a couple of hundred instances under peak load, now we need a way in which if a particular event happens (maybe a new message in the queue) all the instances are notified, what are the potential approaches to achieve this, please advise..

One pattern is to have a common marker (a blob) which changes when the common state if updated. Each instance can proactively check the blob's etag to determine if it has the state has changed, and if so, it knows to reload its state.
Note that:
it's important for the instance to check (rather than wait for a change notification) because there can be a lag in most notification mechanisms. For example, a blob trigger could lag several minutes.
There's no way to bypass the load-balancer and send a message to a specific instance. So you can't proactively send an invalidation message.
Another pattern is to have you state fully externalized in something like Redis. That's easy to invalidate and update. (Although that's essentially just a special case of the prior suggestion)

Azure queue - can I verify a message will be read only once?

I am using an Azure queue and have several different processes reading from the queue.
My system is built in a way that assumes each message is read only once.
This Microsoft article claims Azure queues have an at least once delivery guarantee which potentially means two processes can read the same message from the queue.
This StackOverflow thread claims that if I use GetMessage then the message becomes invisible to all other processes for the invisibility timeout.
Assuming I use GetMessage() and never exceed the message invisibility time before I DeleteMessage, can I assume I will get each message only once?

I think there is a property in queue message named DequeueCount, which is the number of times this message has been dequeued. And it's maintained by queue service. I think you can use this property to identify whether your message had been read before.
https://learn.microsoft.com/en-us/dotnet/api/azure.storage.queues.models.queuemessage.dequeuecount?view=azure-dotnet

No. The following can happen:
GetMessage()
Add some records in a database...
Generate some files...
DeleteMessage() -> Unexpected failure (process that crashes, instance that reboots, network connectivity issues, ...)
In this case your logic was executed without calling DeleteMessage. This means, once the invisibility timeout expires, the message will appear in the queue and be processed once again. You will need to make sure that your process is idempotent:
Idempotence is the property of certain operations in mathematics and
computer science, that they can be applied multiple times without
changing the result beyond the initial application.
An alternative solution would be to use Service Bus Queues with the ReceiveAndDelete mode (see this page under How to Receive Messages from a Queue). If you receive the message it will be marked as consumed and never appear again. This way you can be sure it is delivered At-Most-Once (see the comparison with Storage Queues here). But then again, if something happens while your are processing the message (ie: server crashes, ...), you could loose valuable information.
Update:
This will simulate an At-Most-Once in storage queues. The message can arrive multiple times via GetMessage, but will only be processed once by your business logic (with the risk that some of your business logic will never execute).
GetMessage()
DeleteMessage()
AddRecordsToDatabase()
GenerateFiles()

Windows Azure - leader instance without single point of failure

I am looking for a way to have a "Singleton" module over multiple worker role instances.
I would like to have a parallel execution model with Queues and multiple worker roles in Azure.
The idea is that would like to have a "master" instance, that is let's say checking for new data, and is scheduling it by adding it to a queue, processing all messages from a special queue, that is not processed by nobody else, and has mounted blob storage as a virtual drive, with read/write access.
I will always have only one "master instance". When that master instance goes down for some reason, another instance from the one already instantiated should very quickly be "elected" for a master instance (couple of seconds). This should happen before the broken instance is replaced by a new one by the Azure environment (about 15 min).
So it will be some kind of self-organizing, dynamic environment.
I was thinking of having some locking, based on a storage or table data. the opportunity to set lock timeouts and some kind of "watchdog" timer if we can talk with microprocessor terminology.

There is general approach to what you seek to achieve.
First, your master instance. You could do your check based on instance ID. It is fairly easy. You need RoleEnvironment.CurrentRoleInstance to get the "Current instance", now compare the Id property with what you get out of RoleEnvironment.CurrentRoleInstance.Role.Instances first member ordered by Id. Something like:
var instance = RoleEnvironment.CurrentRoleInstance;
if(instance.Id.Equals(instance.Role.Instances.OrderBy(ins => ins.Id).First().Id))
{
// you are in the single master
}
Now you need to elect master upon "Healing"/recycling.
You need to get the RoleEnvironment's Changed event. Check if it is TopologyChange (just check whether it is topology change, you don't need the exact change in topology). And if it is Topology Change - elect the next master based on the above algorithm. Check out this great blog post on how to exactly perform events hooking and change detection.
Forgot to add.
If you like locks - blob lease is the best way to acquire / check locks. However working with just the RoleEnvironment events and the simple master election based on Instance ID, I don't think you'll need that complicated locking mechanism. Besides - everything lives in the Queue until it is successfully processed. So if the master dies before it processes something, the "next master" will process it.

Distributed pub/sub with single consumer per message type

I have no clue if it's better to ask this here, or over on Programmers.SE, so if I have this wrong, please migrate.
First, a bit about what I'm trying to implement. I have a node.js application that takes messages from one source (a socket.io client), and then does processing on the message, which might result in zero or more messages back out, either to the sender, or other clients within that group.
For the processing, I would like to essentially just shove the message into a queue, then it works its way through various message processors that might kick off their own items, and eventually, the bit running socket.io is informed "Hey, send this message back"
As a concrete example, say a user signs into the service, that sign in message is then placed in the queue, where the authorization processor gets it, does it's thing, then places a message back in the queue saying the client's been authorized. This goes back to the socket.io socket that is connected to the client, along with other clients that might be interested. It can also go to other subsystems that might want to do more processing on authorization (looking up user info, sending more info to the client based on their data, etc).
If I wanted strong coupling, this would be easy, but I tried that before, and it just goes to a mess of spaghetti code that's very fragile, and I would like to avoid that. Another wrench in the setup is this should be cluster-able, which is where the real problem comes in. There might be more than one, say, authorization processor running. But the authorization message should be processed only once.
So, in short, I'm looking for a pattern/technique that will allow me to, essentially, have multiple "groups" of subscribers for a message, and the message will be processed only once per group.
I thought about maybe having each instance of a processor generate a unique name that would be used as a list in Reids. This name would then be registered with some sort of dispatch handler, and placed into a set for that group of subscribers. Then when a message arrives, the dispatch pulls a random member out of that set, and places it into that list. While it seems like this would work, it seems somewhat over-complicated and fragile.
The core problem is I've never designed a system like this, so I'm not even sure the proper terms to use or look up. If anyone can point me in the right direction for this, I would be most appreciative.

I think what your describing is similar to https://www.getbridge.com/ service. I it but ended up writing my own based on zeromq, it allows you to register services, req -> <- rec and channels which are pub / sub workers.
As for the design, I used a client -> broker -> services & channels which are all plug and play using auto discovery, you have the services register their schema with the brokers who open a tcp connection so that brokers on other servers can communicate with that broker groups services. Then internal services and clients connect via unix sockets or ipc channels which ever is preferred.

I ended up wrapping around the redis publish/subscribe functions a bit to do this. Each type of message processor gets a "group name", and there can be multiple instances of the processor within that group (so multiple instances of the program can run for clustering).
When publishing a message, I generate an incremental ID, then store the message in a string key with that ID, then publish the message ID.
On the receiving end, the first thing the subscriber does is attempt to add the message ID it just got from the publisher into a set of received messages for that group with sadd. If sadd returns 0, the message has already been grabbed by another instance, and it just returns. If it returns 1, the full message is pulled out of the string key and sent to the listener.
Of course, this relies on redis being single threaded, which I imagine will continue to be the case.

What you might be looking for is an AMQP protocol implementation,where you can have queue get custom exchanges,and implement a pub-sub model.
RabbitMQ - a popular amqp protocol implementation with lots of libraries
it also has node.js library

Controlling azure worker roles concurrency in multiple instance

I have a simple work role in azure that does some data processing on an SQL azure database.
The worker basically adds data from a 3rd party datasource to my database every 2 minutes. When I have two instances of the role, this obviously doubles up unnecessarily. I would like to have 2 instances for redundancy and the 99.95 uptime, but do not want them both processing at the same time as they will just duplicate the same job. Is there a standard pattern for this that I am missing?
I know I could set flags in the database, but am hoping there is another easier or better way to manage this.
Thanks

As Mark suggested, you can use an Azure queue to post a message. You can have the worker role instance post a followup message to the queue as the last thing it does when processing the current message. That should deal with the issue Mark brought up regarding the need for a semaphore. In your queue message, you can embed a timestamp marking when the message can be processed. When creating a new message, just add two minutes to current time.
And... in case it's not obvious: in the event the worker role instance crashes before completing processing and fails to repost a new queue message, that's fine. In this case, the current queue message will simply reappear on the queue and another instance is then free to process it.

There is not a super easy way to do this, I dont think.
You can use a semaphore as Mark has mentioned, to basically record the start and the stop of processing. Then you can have any amount of instances running, each inspecting the semaphore record and only acting out if semaphore allows it.
However, the caveat here is that what happens if one of the instances crashes in the middle of processing and never releases the semaphore? You can implement a "timeout" value after which other instances will attempt to kick-start processing if there hasnt been an unlock for X amount of time.
Alternatively, you can use a third party monitoring service like AzureWatch to watch for unresponsive instances in Azure and start a new instance if the amount of "Ready" instances is under 1. This will save you can save some money by not having to have 2 instances up and running all the time, but there is a slight lag between when an instance fails and when a new one is started.

A Semaphor as suggested would be the way to go, although I'd probably go with a simple timestamp heartbeat in blob store.
The other thought is, how necessary is it? If your loads can sustain being down for a few minutes, maybe just let the role recycle?

Small catch on David's solution. Re-posting the message to the queue would happen as the last thing on the current execution so that if the machine crashes along the way the current message would expire and re-surface on the queue. That assumes that the message was originally peeked and requires a de-queue operation to remove from the queue. The de-queue must happen before inserting the new message to the queue. If the role crashes in between these 2 operations, then there will be no tokens left in the system and will come to a halt.
The ESB dup check sounds like a feasible approach, but it does not sound like it would be deterministic either since the bus can only check for identical messages currently existing in a queue. But if one of the messages comes in right after the previous one was de-queued, there is a chance to end up with 2 processes running in parallel.
An alternative solution, if you can afford it, would be to never de-queue and just lease the message via Peek operations. You would have to ensure that the invisibility timeout never goes beyond the processing time in your worker role. As far as creating the token in the first place, the same worker role startup strategy described before combined with ASB dup check should work (since messages would never move from the queue).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string