Azure table storage - Distributed locking - azure

I am storing event data in table storage. There are multiple instances of a worker role that need to access this. Each worker role instance needs to access a unique row in this table and do some processing with this data, and if it succeeds, needs to mark this data as completed so that any other instance doesn't pick this up. While processing, this row needs to be invisible to other workers so that they dont process this as well.
Is there a design that can solve this problem?

As such Azure Tables doesn't have a locking mechanism. It is available for blobs and queues.
One possible way for you to solve this problem is to use Master/Slave Pattern. So let's assume that you have 5 worker role instances running. Periodically (say every 30 seconds), all of these instances will try to acquire lease on a blob. Only one instance will be able to succeed and that instance will become the master (all other instances will become slaves).
Now what the mater will do is fetch the data from table (say 5 records) and inserts them in a queue as separate messages. Once the master does that, it automatically becomes the slave. What slaves would do is fetch one message from the queue (by dequeuing the message so that other instances can't see that message), process it and then update the record in the table. Once the slave has done its job, it will go back to sleep only to wake up after that predetermined time.
Please see Competing Consumer Patterns for more details.

Use Azure Queues and a producer consumer pattern, write Unit of Work as a message to the queue on the producer side and let your worker roles consume the work from the queue and process it. Queue would handle making that message invisible while it is being processed to avoid duplication, each worker role can then remove the message from the queue after successfully processing it.

Related

Event Hub -- how to prevent duplicate handling when consumers scale out

When we have multiple consumers of Event Hub (or any messaging service, for that matter), how to make sure that no message is processed twice especially in a situation when consumer auto-scales out to multiple instances?
I know we could keep track of last message processed but then again, between the check if message was processed and actuall, processing it,other instance could process it already (race condition?.
so, how to solve that in a scalable way?
[UPDATE]
i am aware there is a recommendation to have at least as many partitions as there are consumers but what to do in case when a single consumer cannot process messages directed to it but needs to scale out to multiple instances?
Each processor takes a lease on a partition, see the docs
An event processor instance typically owns and processes events from one or more partitions. Ownership of partitions is evenly distributed among all the active event processor instances associated with an event hub and consumer group combination.
So scaling out doesn't result in duplicate message processing because a new processor cannot take a lease on a partition that is already being handled by another processor.
Then, regarding your comment:
i am aware there is a recommendation to have at least as many partitions as there are consumers
It is the other way around: it is recommended to have as many consumers as you have partitions. If you have more consumers than partitions the consumers will compete with each other to obtain a lock on a partition.
Now, regarding duplicate messages, since Event Hub guarantees at-least-once delivery there isn't much you can do to prevent this. There aren't that many scalable services that offer at-most-once deliveries, I know that Azure Service Bus Queues do offer this if you really need it.
The question may arise what can cause duplicate message processing. Well, when processing message the processor does some checkpointing: once in a while it will store its position within a partition event sequence (remember, a partition is bound to a single processor). Now when the processer instance crashes between two checkpoint events a new instance will resume processing messages from the position of the last checkpoint. That may very well lead to older messages being processed again.
If a reader disconnects from a partition, when it reconnects it begins reading at the checkpoint that was previously submitted by the last reader of that partition in that consumer group.
So, that means you need to make sure your processing logic is idempotent. How, that is up to you as I don't know your use case.
One option is to track each individual message to see whether it is already processed or not. If you do not have a unique ID to check on maybe you can generate a hash of the whole message and compare with that.

Understanding Azure Event Hubs partitioned consumer pattern

Azure Event Hub uses the partitioned consumer pattern described in the docs.
I have some problems understanding the consumer side of this model when it comes to a real world scenario.
So lets say I have 1000 messages send to the event hub with 4 partitions, not defining any partition Id. This means the messages will go to all partitions using the round-robin method.
Now I want to have two applications distributing the messages to two different databases. My questions there are:
Lets say for the first application, I want to store all messages in Database 1. This means, for maximum speed, In my consumer application I need to have 4 threads (consumers), each listening to one partition of the event hub, right? Each of them also has to store their own offset for the partition they're reading (checkpoint).
Lets say my second application wants to filter the messages and only store a subset of them in Database 2. There I also need 4 consumers since I don't know which message goes to which partition, right?
Also for the two applications I need to have two consumer groups, but why? Is the filtering of the messages defined in the consumer group? I don't get it really why I need this one, since the applications consumers store the partition checkpoints by themselves and I can do the filtering within the applications itself.
I know there is the EventProcessorHost class but I want to understand the concept of the EventHub on a lower level.
Lets say for the first application, I want to store all messages in Database 1. This means, for maximum speed, In my consumer application I need to have 4 threads (consumers), each listening to one partition of the event hub, right? Each of them also has to store their own offset for the partition they're reading (checkpoint).
Correct, you should have a process per provisioned partition. So, if you have 4 processors you should have 4 processes, each processing the messages of a specific partition. If you process the messages using an EventProcessorHost it will take care of the spinning up of the processes for you.
Lets say my second application wants to filter the messages and only store a subset of them in Database 2. There I also need 4 consumers since I don't know which message goes to which partition, right?
What do you mean with a consumer? You need another 4 processes to process the messages but they should be configured to read using a different consumer group. Otherwise they will compete with the processes of 1
Also for the two applications I need to have two consumer groups, but why? Is the filtering of the messages defined in the consumer group? I don't get it really why I need this one, since the applications consumers store the partition checkpoints by themselves and I can do the filtering within the applications itself.
Let us define a consumer group:
Consumer groups enable multiple consuming applications to each have a separate view of the incoming message stream, and to read the stream independently at its own pace with its own offset
So yes, you need 2 different consumer groups.
Each consumer group will get all messages send to the event hub partitions. Each consumer group tracks its own progress in the stream of messages. That is why you need two for your scenario.
Say you define an additional consumer group called "App2-Consumer-Group", the reader processes will receive all messages but should take no action for messages they are not interested in.
If you would not create an additional consumer group, the reader processes for the default consumer group will process the messages for the first application and mark them as processed using the check-pointing mechanism. The reader processes for the second application won't get any messages since they are already marked as processed. (In real life, when using one consumer group with some messages might be picked up by the reader processes for the first application and some messages might be picked up by reader processes for the second application as the processes will try to get a lock on a specific partition)
I think this image shows clearly how consumer groups track their own progress in the stream of message and hence why you need tow of them if you have 2 different processing logic for the 2 different applications:

Failure handling for Queue Centric work pattern

I am planning to use a queue centric design as described here for one of my applications. That essentially consists of using a Azure queue where work requests are queued from the UI. A worker reads from the queue, processes and deletes the message from the queue.
The 'work' done by the worker is within a transaction so if the worker fails before completing, upon restart it again picks up the same message (as it has not be deleted from the queue) and tries to perform the operation again (up to a max number of retries)
To scale I could use two methods:
Multiple workers each with a separate queue. So if I have five workers W1 to W5, I have 5 queues Q1 to Q5 and each worker knows which queue to read from and failure handling is similar as the case with one queue and one worker
One queue and multiple workers. Here failure/Retry handling here would be more involved and might end up using the 'Invisibility' time in the message queue to make sure no two workers pick up the same job. The invisibility time would have to be calculated to make sure that its enough for the job to complete and yet not be large enough that retries are performed after a long time.
Would like to know if the 1st approach is the correct way to go? What are robust ways of handling failures in the second approach above?
You would be better off taking approach 2 - a single queue, but with multiple workers.
This is better because:
The process that delivers messages to the queue only needs to know about a single queue endpoint. This reduces complexity at this end;
Scaling the number of workers that are pulling from the queue is now decoupled from any code / configuration changes - you can scale up and down much more easily (and at runtime)
If you are worried about the visibility, you can initially choose a default timespan, and then if the worker looks like it's taking too long, it can periodically call UpdateMessage() to update the visibility of the message.
Finally, if your worker timesout and failed to complete processing of the message, it'll be picked up again by some other worker to try again. You can also use the DequeueCount property of the message to manage number of retries.
Multiple workers each with a separate queue. So if I have five workers
W1 to W5, I have 5 queues Q1 to Q5 and each worker knows which queue
to read from and failure handling is similar as the case with one
queue and one worker
With this approach I see following issues:
This approach makes your architecture tightly coupled (thus beating the whole purpose of using queues). Because each worker role listens to a dedicated queue, the web application responsible for pushing messages in the queue always need to know how many workers are running. Anytime you scale up or down your worker role, some how you need to tell web application so that it can start pushing messages in appropriate queue.
If a worker role instance is taken down for whatever reason there's a possibility that some messages may not be processed ever as other worker role instances are working on their dedicated queues.
There may be a possibility of under utilization/over utilization of worker role instances depending on how web application pushes the messages in the queue. For optimal utilization, web application should know about the worker role utilization so that it can decide which queue to send message to. This is certainly not a desired thing for a web application to do.
I believe #2 is the correct way to go. #Brendan Green has covered your concerns about #2 in his answer excellently.

What assumptions can I make about global time on Azure?

I want my Azure role to reprocess data in case of sudden failures. I consider the following option.
For every block of data to process I have a database table row and I could add a column meaning "time of last ping from a processing node". So when a node grabs a data block for processing it sets "processing" state and that time to "current time" and then it's the node responsibility to update that time say every one minute. Then periodically some node will ask for "all blocks that have processing state and ping time larger than ten minutes" and consider those blocks as abandoned and somehow queue them for reprocessing.
I have one very serious concern. The above approach requires that nodes have more or less the same time. Can I rely on all Azure nodes having the same time with some reasonable precision (say several seconds)?
For processing times under 2 hrs, you can usually rely on queue semantics (visibility timeout). If you have the data stored in blob storage, you can have a worker pop a queue message containing the name of the blob to work on and set a reasonable visibility timeout on the message (up to 2 hrs today). Once it completes the work, it can delete the queue message. If it fails, the delete is never called and after the visibility timeout, it will reappear on the queue for reprocessing. This is why you want your work to be idempotent, btw.
For processing that lasts longer than two hours, I generally recommend a leasing strategy where the worker leases the underlying blob data (if possible or a dummy blob otherwise) using the intrisic lease functionality in Windows Azure blob storage. When a worker goes to retrieve a file, it tries to lease it. A file that is already leased is indicative of a worker role currently processing it. If failure occurs, the lease will be broken and it will become leasable by another instance. Leases must be renewed every min or so, but they can be held indefinitely.
Of course, you are keeping the data to be processed in blob storage, right? :)
As already indicated, you should not rely on synchronized times between VM nodes. If you store datetimes for any reason - use UTC or you will be sorry later.
The answer here isn't to use time based synchronization (if you would however, make sure you use UTCNow), but there is still no guarantee anywhere that the clocks are synced. Nor should there be.
For the problem you are describing a queue based system is the answer. I've been referencing a lot to it, and will do it again, but I've explained some benefits of queue based systems in my blog post.
The idea is the following:
You put a work item to the queue
Your worker role (one or many of them) peeks & locks the message
You try to process the message, if you succeed, you remove the message from the queue,
if not, you let it stay where it is
With your approach I would use AppFabric Queues because you can also have topics & subscriptions, which allows you to monitor the data items. The example in my blog post coveres this exact scenario, with the only difference being that instead of having a worker role I poll the queue from my web application. But the concept is the same.
I would try this a different way using queue storage. If you pop your block of data on a queue with a timeout then have your processing nodes (worker roles?) pull this data off the queue.
After the data is popped off the queue if the processing node does not delete the entry from the queue it will reappear on the queue for processing after the timeout period.
Remote desktop into a role instance and check (a) the time zone (UTC, I think), and (b) that Internet Time is enabled in Date and Time settings. If so then you can rely on them being no more than a few ms apart. (This is not to say that the suggestions to use a message queue instead won't work, but perhaps they do not suit your needs.)

Controlling azure worker roles concurrency in multiple instance

I have a simple work role in azure that does some data processing on an SQL azure database.
The worker basically adds data from a 3rd party datasource to my database every 2 minutes. When I have two instances of the role, this obviously doubles up unnecessarily. I would like to have 2 instances for redundancy and the 99.95 uptime, but do not want them both processing at the same time as they will just duplicate the same job. Is there a standard pattern for this that I am missing?
I know I could set flags in the database, but am hoping there is another easier or better way to manage this.
Thanks
As Mark suggested, you can use an Azure queue to post a message. You can have the worker role instance post a followup message to the queue as the last thing it does when processing the current message. That should deal with the issue Mark brought up regarding the need for a semaphore. In your queue message, you can embed a timestamp marking when the message can be processed. When creating a new message, just add two minutes to current time.
And... in case it's not obvious: in the event the worker role instance crashes before completing processing and fails to repost a new queue message, that's fine. In this case, the current queue message will simply reappear on the queue and another instance is then free to process it.
There is not a super easy way to do this, I dont think.
You can use a semaphore as Mark has mentioned, to basically record the start and the stop of processing. Then you can have any amount of instances running, each inspecting the semaphore record and only acting out if semaphore allows it.
However, the caveat here is that what happens if one of the instances crashes in the middle of processing and never releases the semaphore? You can implement a "timeout" value after which other instances will attempt to kick-start processing if there hasnt been an unlock for X amount of time.
Alternatively, you can use a third party monitoring service like AzureWatch to watch for unresponsive instances in Azure and start a new instance if the amount of "Ready" instances is under 1. This will save you can save some money by not having to have 2 instances up and running all the time, but there is a slight lag between when an instance fails and when a new one is started.
A Semaphor as suggested would be the way to go, although I'd probably go with a simple timestamp heartbeat in blob store.
The other thought is, how necessary is it? If your loads can sustain being down for a few minutes, maybe just let the role recycle?
Small catch on David's solution. Re-posting the message to the queue would happen as the last thing on the current execution so that if the machine crashes along the way the current message would expire and re-surface on the queue. That assumes that the message was originally peeked and requires a de-queue operation to remove from the queue. The de-queue must happen before inserting the new message to the queue. If the role crashes in between these 2 operations, then there will be no tokens left in the system and will come to a halt.
The ESB dup check sounds like a feasible approach, but it does not sound like it would be deterministic either since the bus can only check for identical messages currently existing in a queue. But if one of the messages comes in right after the previous one was de-queued, there is a chance to end up with 2 processes running in parallel.
An alternative solution, if you can afford it, would be to never de-queue and just lease the message via Peek operations. You would have to ensure that the invisibility timeout never goes beyond the processing time in your worker role. As far as creating the token in the first place, the same worker role startup strategy described before combined with ASB dup check should work (since messages would never move from the queue).

Resources