What assumptions can I make about global time on Azure? - azure

I want my Azure role to reprocess data in case of sudden failures. I consider the following option.
For every block of data to process I have a database table row and I could add a column meaning "time of last ping from a processing node". So when a node grabs a data block for processing it sets "processing" state and that time to "current time" and then it's the node responsibility to update that time say every one minute. Then periodically some node will ask for "all blocks that have processing state and ping time larger than ten minutes" and consider those blocks as abandoned and somehow queue them for reprocessing.
I have one very serious concern. The above approach requires that nodes have more or less the same time. Can I rely on all Azure nodes having the same time with some reasonable precision (say several seconds)?

For processing times under 2 hrs, you can usually rely on queue semantics (visibility timeout). If you have the data stored in blob storage, you can have a worker pop a queue message containing the name of the blob to work on and set a reasonable visibility timeout on the message (up to 2 hrs today). Once it completes the work, it can delete the queue message. If it fails, the delete is never called and after the visibility timeout, it will reappear on the queue for reprocessing. This is why you want your work to be idempotent, btw.
For processing that lasts longer than two hours, I generally recommend a leasing strategy where the worker leases the underlying blob data (if possible or a dummy blob otherwise) using the intrisic lease functionality in Windows Azure blob storage. When a worker goes to retrieve a file, it tries to lease it. A file that is already leased is indicative of a worker role currently processing it. If failure occurs, the lease will be broken and it will become leasable by another instance. Leases must be renewed every min or so, but they can be held indefinitely.
Of course, you are keeping the data to be processed in blob storage, right? :)
As already indicated, you should not rely on synchronized times between VM nodes. If you store datetimes for any reason - use UTC or you will be sorry later.

The answer here isn't to use time based synchronization (if you would however, make sure you use UTCNow), but there is still no guarantee anywhere that the clocks are synced. Nor should there be.
For the problem you are describing a queue based system is the answer. I've been referencing a lot to it, and will do it again, but I've explained some benefits of queue based systems in my blog post.
The idea is the following:
You put a work item to the queue
Your worker role (one or many of them) peeks & locks the message
You try to process the message, if you succeed, you remove the message from the queue,
if not, you let it stay where it is
With your approach I would use AppFabric Queues because you can also have topics & subscriptions, which allows you to monitor the data items. The example in my blog post coveres this exact scenario, with the only difference being that instead of having a worker role I poll the queue from my web application. But the concept is the same.

I would try this a different way using queue storage. If you pop your block of data on a queue with a timeout then have your processing nodes (worker roles?) pull this data off the queue.
After the data is popped off the queue if the processing node does not delete the entry from the queue it will reappear on the queue for processing after the timeout period.

Remote desktop into a role instance and check (a) the time zone (UTC, I think), and (b) that Internet Time is enabled in Date and Time settings. If so then you can rely on them being no more than a few ms apart. (This is not to say that the suggestions to use a message queue instead won't work, but perhaps they do not suit your needs.)

Related

Azure table storage - Distributed locking

I am storing event data in table storage. There are multiple instances of a worker role that need to access this. Each worker role instance needs to access a unique row in this table and do some processing with this data, and if it succeeds, needs to mark this data as completed so that any other instance doesn't pick this up. While processing, this row needs to be invisible to other workers so that they dont process this as well.
Is there a design that can solve this problem?
As such Azure Tables doesn't have a locking mechanism. It is available for blobs and queues.
One possible way for you to solve this problem is to use Master/Slave Pattern. So let's assume that you have 5 worker role instances running. Periodically (say every 30 seconds), all of these instances will try to acquire lease on a blob. Only one instance will be able to succeed and that instance will become the master (all other instances will become slaves).
Now what the mater will do is fetch the data from table (say 5 records) and inserts them in a queue as separate messages. Once the master does that, it automatically becomes the slave. What slaves would do is fetch one message from the queue (by dequeuing the message so that other instances can't see that message), process it and then update the record in the table. Once the slave has done its job, it will go back to sleep only to wake up after that predetermined time.
Please see Competing Consumer Patterns for more details.
Use Azure Queues and a producer consumer pattern, write Unit of Work as a message to the queue on the producer side and let your worker roles consume the work from the queue and process it. Queue would handle making that message invisible while it is being processed to avoid duplication, each worker role can then remove the message from the queue after successfully processing it.

Azure Storage Performance Queue vs Table

I've got a nice logging system I've set up that writes to Azure Table Storage and it has worked well for a long time. However, there are certain places in my code where I need to now write a lot of messages to the log (50-60 msgs) instead of just a couple. It is also important enough that I can't start a new thread to finish writing to the log and return the MVC action before I know the log is successful because theoretically that thread could die. I have to write to the log before I return data to the web user.
According to the Azure dashboard, Table Storage transactions take ~37ms to commit, end to end (E2E), while queues only take ~6ms E2E to commit.
I'm now considering not logging directly to table storage, and instead log to an Azure Queue, then have a batch job run that reads off the queue and then puts them in their proper place in table storage. That way I can still index them properly via their partition and row keys. I can also write just a single queue message with all of the log entries. So it should only take 6ms instead of (37 * 50) ms.
I know that there are Table Storage batch operations. However, each of the log entries typically goes to different partition, and batch ops need to stay within a single partition.
I know that queue messages only live for 7 days, so I'll make sure I store queue messages in a new mechanism if they're older than a day (if it doesn't work the first 50 times, it just isn't going to work).
My question, then is: what am I not thinking about? How could this completely kick me in the balls in 4 months down the road?

Requeue or delete messages in Azure Storage Queues via WebJobs

I was hoping if someone can clarify a few things regarding Azure Storage Queues and their interaction with WebJobs:
To perform recurring background tasks (i.e. add to queue once, then repeat at set intervals), is there a way to update the same message delivered in the QueueTrigger function so that its lease (visibility) can be extended as a way to requeue and avoid expiry?
With the above-mentioned pattern for recurring background jobs, I'm also trying to figure out a way to delete/expire a job 'on demand'. Since this doesn't seem possible outside the context of WebJobs, I was thinking of maybe storing the messageId and popReceipt for the message(s) to be deleted in Table storage as persistent cache, and then upon delivery of message in the QueueTrigger function do a Table lookup to perform a DeleteMessage, so that the message is not repeated any more.
Any suggestions or tips are appreciated. Cheers :)
Azure Storage Queues are used to store messages that may be consumed by your Azure Webjob, WorkerRole, etc. The Azure Webjobs SDK provides an easy way to interact with Azure Storage (that includes Queues, Table Storage, Blobs, and Service Bus). That being said, you can also have an Azure Webjob that does not use the Webjobs SDK and does not interact with Azure Storage. In fact, I do run a Webjob that interacts with a SQL Azure database.
I'll briefly explain how the Webjobs SDK interact with Azure Queues. Once a message arrives to a queue (or is made 'visible', more on this later) the function in the Webjob is triggered (assuming you're running in continuous mode). If that function returns with no error, the message is deleted. If something goes wrong, the message goes back to the queue to be processed again. You can handle the failed message accordingly. Here is an example on how to do this.
The SDK will call a function up to 5 times to process a queue message. If the fifth try fails, the message is moved to a poison queue. The maximum number of retries is configurable.
Regarding visibility, when you add a message to the queue, there is a visibility timeout property. By default is zero. Therefore, if you want to process a message in the future you can do it (up to 7 days in the future) by setting this property to a desired value.
Optional. If specified, the request must be made using an x-ms-version of 2011-08-18 or newer. If not specified, the default value is 0. Specifies the new visibility timeout value, in seconds, relative to server time. The new value must be larger than or equal to 0, and cannot be larger than 7 days. The visibility timeout of a message cannot be set to a value later than the expiry time. visibilitytimeout should be set to a value smaller than the time-to-live value.
Now the suggestions for your app.
I would just add a message to the queue for every task that you want to accomplish. The message will obviously have the pertinent information for processing. If you need to schedule several tasks, you can run a Scheduled Webjob (on a schedule of your choice) that adds messages to the queue. Then your continuous Webjob will pick up that message and process it.
Add a GUID to each message that goes to the queue. Store that GUID in some other domain of your application (a database). So when you dequeue the message for processing, the first thing you do is check against your database if the message needs to be processed. If you need to cancel the execution of a message, instead of deleting it from the queue, just update the GUID in your database.
There's more info here.
Hope this helps,
As for the first part of the question, you can use the Update Message operation to extend the visibility timeout of a message.
The Update Message operation can be used to continually extend the
invisibility of a queue message. This functionality can be useful if
you want a worker role to “lease” a queue message. For example, if a
worker role calls Get Messages and recognizes that it needs more time
to process a message, it can continually extend the message’s
invisibility until it is processed. If the worker role were to fail
during processing, eventually the message would become visible again
and another worker role could process it.
You can check the REST API documentation here: https://msdn.microsoft.com/en-us/library/azure/hh452234.aspx
For the second part of your question, there are really multiple ways and your method of storing the id/popReceipt as a lookup is a possible option, you can actually have a Web Job dedicated to receive messages on a different queue (e.g plz-delete-msg) and you send a message containing the "messageId" and this Web Job can use Get Message operation then Delete it. (you can make the job generic by passing the queue name!)
https://msdn.microsoft.com/en-us/library/azure/dd179474.aspx
https://msdn.microsoft.com/en-us/library/azure/dd179347.aspx

Is it acceptable to use ThreadPool.GetAvailableThreads to throttle the amount of work a service performs?

I have a service which polls a queue very quickly to check for more 'work' which needs to be done. There is always more more work in the queue than a single worker can handle. I want to make sure a single worker doesn't grab too much work when the service is already at max capacity.
Let say my worker grabs 10 messages from the queue every N(ms) and uses the Parallel Library to process each message in parallel on different threads. The work itself is very IO heavy. Many SQL Server queries and even Azure Table storage (http requests) are made for a single unit of work.
Is using the TheadPool.GetAvailableThreads() the proper way to throttle how much work the service is allowed to grab?
I see that I have access to available WorkerThreads and CompletionPortThreads. For an IO heavy process, is it more appropriate to look at how many CompletionPortThreads are available? I believe 1000 is the number made available per process regardless of cpu count.
Update - Might be important to know that the queue I'm working with is an Azure Queue. So, each request to check for messages is made as an async http request which returns with the next 10 messages. (and costs money)
I don't think using IO completion ports is a good way to work out how much to grab.
I assume that the ideal situation is where you run out of work just as the next set arrives, so you've never got more backlog than you can reasonably handle.
Why not keep track of how long it takes to process a job and how long it takes to fetch jobs, and adjust the amount of work fetched each time based on that, with suitable minimum/maximum values to stop things going crazy if you have a few really cheap or really expensive jobs?
You'll also want to work out a reasonable optimum degree of parallelization - it's not clear to me whether it's really IO-heavy, or whether it's just "asynchronous request heavy", i.e. you spend a lot of time just waiting for the responses to complicated queries which in themselves are cheap for the resources of your service.
I've been working virtually the same problem in the same environment. I ended up giving each WorkerRole an internal work queue, implemented as a BlockingCollection<>. There's a single thread that monitors that queue - when the number of items gets low it requests more items from the Azure queue. It always requests the maximum number of items, 32, to cut down costs. It also has automatic backoff in the event that the queue is empty.
Then I have a set of worker threads that I started myself. They sit in a loop, pulling items off the internal work queue. The number of worker threads is my main way to optimize the load, so I've got that set up as an option in the .cscfg file. I'm currently running 35 threads/worker, but that number will depend on your situation.
I tried using TPL to manage the work, but I found it more difficult to manage the load. Sometimes TPL would under-parallelize and the machine would be bored, other times it would over-parallelize and the Azure queue message visibility would expire while the item was still being worked.
This may not be the optimal solution, but it seems to be working OK for me.
I decided to keep an internal counter of how many message are currently being processed. I used Interlocked.Increment/Decrement to manage the counter in a thread-safe manner.
I would have used the Semaphore class since each message is tied to its own Thread but wasn't able to due to the async nature of the queue poller and the code which spawned the threads.

Controlling azure worker roles concurrency in multiple instance

I have a simple work role in azure that does some data processing on an SQL azure database.
The worker basically adds data from a 3rd party datasource to my database every 2 minutes. When I have two instances of the role, this obviously doubles up unnecessarily. I would like to have 2 instances for redundancy and the 99.95 uptime, but do not want them both processing at the same time as they will just duplicate the same job. Is there a standard pattern for this that I am missing?
I know I could set flags in the database, but am hoping there is another easier or better way to manage this.
Thanks
As Mark suggested, you can use an Azure queue to post a message. You can have the worker role instance post a followup message to the queue as the last thing it does when processing the current message. That should deal with the issue Mark brought up regarding the need for a semaphore. In your queue message, you can embed a timestamp marking when the message can be processed. When creating a new message, just add two minutes to current time.
And... in case it's not obvious: in the event the worker role instance crashes before completing processing and fails to repost a new queue message, that's fine. In this case, the current queue message will simply reappear on the queue and another instance is then free to process it.
There is not a super easy way to do this, I dont think.
You can use a semaphore as Mark has mentioned, to basically record the start and the stop of processing. Then you can have any amount of instances running, each inspecting the semaphore record and only acting out if semaphore allows it.
However, the caveat here is that what happens if one of the instances crashes in the middle of processing and never releases the semaphore? You can implement a "timeout" value after which other instances will attempt to kick-start processing if there hasnt been an unlock for X amount of time.
Alternatively, you can use a third party monitoring service like AzureWatch to watch for unresponsive instances in Azure and start a new instance if the amount of "Ready" instances is under 1. This will save you can save some money by not having to have 2 instances up and running all the time, but there is a slight lag between when an instance fails and when a new one is started.
A Semaphor as suggested would be the way to go, although I'd probably go with a simple timestamp heartbeat in blob store.
The other thought is, how necessary is it? If your loads can sustain being down for a few minutes, maybe just let the role recycle?
Small catch on David's solution. Re-posting the message to the queue would happen as the last thing on the current execution so that if the machine crashes along the way the current message would expire and re-surface on the queue. That assumes that the message was originally peeked and requires a de-queue operation to remove from the queue. The de-queue must happen before inserting the new message to the queue. If the role crashes in between these 2 operations, then there will be no tokens left in the system and will come to a halt.
The ESB dup check sounds like a feasible approach, but it does not sound like it would be deterministic either since the bus can only check for identical messages currently existing in a queue. But if one of the messages comes in right after the previous one was de-queued, there is a chance to end up with 2 processes running in parallel.
An alternative solution, if you can afford it, would be to never de-queue and just lease the message via Peek operations. You would have to ensure that the invisibility timeout never goes beyond the processing time in your worker role. As far as creating the token in the first place, the same worker role startup strategy described before combined with ASB dup check should work (since messages would never move from the queue).

Resources