Worker processes called in order azure - azure

If Multiple worker processes have to called in order after every task by the previous worker gets done (there is a queue containing pointer to blobs and every worker has multiple instances. Pls see my previous questions.) how should this be done ?
Will Azure fabric do this automatically ? or is there a way to set this in the config file ?

You just follow the same process that you're already got but with more layers. If worker 1 reads something from queue 1, and it needs to let worker 2 know that it's time for it to start processing the same file, worker 1 simply puts a message in queue 2.
Edit: OK, let me see if I fully understand what you're after here. It sounds like you have here is a batch of files that need to go through several processes, but they can't go on to the next step of the process until they've all finished going through the previous step.
If that is the case then, no, there is nothing in Azure that will do that for you automatically.
Because of this, if possible I'd rework my workers so that each file could just be sent on without worrying about what state the other files were in.
If that is not possible, then you need some way of monitoring which files have been completed and which ones are still pending. One way to do this (and hopefully you can expand on this) is the code that creates the batch, creates a progress row in a table somewhere (SQL Azure or Azure Tables, it doesn't matter really) for each file, sends a message to worker one and starts a background task to monitor this table.
When worker 1 finishes processing a file, it updates the relevant row in the monitoring table to say, "Worker 1 finished".
The background thread that was created above waits until all of the rows have "Worker 1 finished" set to true, then creates the messages for Worker 2 and starts looking at the "Worker 2 finished" flag. Rinse repeat for as many worker steps as you have.
When all steps are finished, you'll probably want the background task to clean up this table and also have some sort of timeout in case a message gets lost somewhere.

Although what #knightpfhor is suggesting would do the trick, I would try and go about this in a more simple kind of way without referencing the names of workers :-)
Specifically, If there is a way you already know how many docs need to be processed, I would first create N-amount of rows in a Table, each holdung some info relevant to the current batch, each having columnKey set to be the batch id. I'd then put N number of messages in my queue and let the worker processes pick them up. When each worker is done, it would delete the corresponding row in the table as well. A monitoring process would simoly know a batch started and do a count every once in a while (if it is not cricital, or the worker would do a count after it finishes removing the row) and spawn a new message in the relevant queue for the next worker role to process.
If you wamt even more control you could go with having a row in your table storing the state of your process (processing files, post-processing), etc. In this case, I'd store the state transitions in a queue, and make sure you only make them once. But that's a whole new question alltogether.
Hope it heps.

Related

Splitting a large task to Azure functions and collecting the results (implementing a barrier over a service-bus)

I have tasks that I need to perform in Azure. each task is broken into several parts that need to run in parallel. The number of parts is not known in advance.
I would like to implement this using Azure functions and service bus. I was thinking about the following architecture:
I receive the task in a service bus. Func 1 determines how many sub-parts should be created, Func 2 does the work, and func 3 collects the results and passes them using a service bus.
I could not find an efficient mechanism for collecting the (variable) number of sub-results and knowing when everything has completed; If all sub-parts are ready passing the combined results to the output service bus.
Is there such a mechanism in Azure for collecting results from parallel sub-parts and only after everything is ready sending the data to the next stage? (this is like the barrier synchronization mechanism).
My approach:
Function 1 gets the work from it's message queue. It creates a jobId (GUID) that will be used to correlate all of the pieces. It breaks up the work into sub-parts and records the jobId and sub-parts in a database.
Each sub-part becomes a message that is added into the message queue that Func 2 is listening on.
Once Func 2 receives and processes the message, it places a message in the queue for Func 3.
Func 3 records in the db that this sub-part of the job has completed and check to see if now all of the sub-parts are complete. If they are not, do nothing else, if it is, it now knows that everything has completed and can proceed.
Check Logic Apps to govern the whole process. You can trigger from Service Bus, invoke individual functions through a loop etc. and incorporate results back to the document.

Azure Storage Queue - long time to process

I need to generate quite a number of reports and a report can take about 5 minutes to be generated, large amount of data, many different sources.
The client will post messages to an Azure Storage Queue. There is a worker roles that processes the messages and generates the reports.
If I want to scale this up let's say I end up with 10 worker roles that will process the messages from the queue and generate the reports. Then I will add messages into the queue like this:
message 1: process reports from 1 - 5
message 2: process reports from 6 - 11
........
message 10: process reports from 50 - 55 (might not be accurate the range)
If my worker role 1 will take the first message and put a lock on it but the process will take 5 minutes, the lock will expire and the message will be visible again in the queue so the worker role 2 will take it and start processing it ... and so forth
How can I avoid that consuming the queue message is done only once keeping in mind that the task is a long one?
First of all: Using Azure Storage queues, you should be prepared for all of your operations to be idempotent: In case your queue item is processed multiple times, the same result should happen each time. The reason I bring this up: There's simply no way to guarantee you'll process the message one time (unless you check the DequeueCount property of the message and halt processing accordingly), due to unexpected events such as your role instance crashing/rebooting or your queue item processing code doing something unexpected like throwing an exception.
Next: Queue message invisibility timeout can be programmatically extended. This can be done via the queue api or via one of the language sdk's. In c# (something like this - I didn't test this), extending an additional minute:
queueMessage.UpdateMessage(message,
TimeSpan.FromSeconds(60),
MessageUpdateFields.Visibility);
You can also modify the message along the way (maybe as a hint to your code, to let you know which of the 5 reports has been complete. This should help your specific issue: In the event the message gets reprocessed, you don't have to process all five reports if the message has been modified to say something like "process reports from 3-5"). Note: You can combine the MessageUpdateFields flags via |:
queueMessage.UpdateMessage(message,
TimeSpan.FromSeconds(0),
MessageUpdateFields.Content);
Lastly: If you're concerned with the length of time taken to process a batch of reports, perhaps rethink why you're processing five reports in each message, vs. one report per message. You can always read queue messages in batches. This is getting a bit subjective, as there's really no right or wrong way to do it, but it's just something for you to think about.

Azure queue and table storage transactions best practics

Using azure table storage I read about Entity Group Transactions within the same partition. Now what happens if I use the Azure Queue together with table storage. Is it possible to get handle a message from the queue, insert into table storage. If something breaks, rollback and put the message on the queue again?
Or how should I handle such a scenario with Azure
Tables and queues don't have any associated transactions.
Here's some general queue usage guidance:
Make sure queue actions are idempotent - that is, you'd have the same results executing a queue message more than once, with repeatable side-effects
Set a reasonalbe queue message visibility timeout. If your task looks like it will take longer, you can extend a message's invisibility timeout. This prevents other threads / role instances from grabbing the same queue item while you're still working on it.
For long-running tasks (or those where you want to avoid consuming a resource multiple times if possible), modify your queue message along the way, giving yourself status hints. For example: you have a render-video queue message: 'RENDER|Source-URL'. you're rendering video, and it needs two passes. You've done pass 1 with results stored in a temporary blob. You could modify the message with something like 'RENDER|Source-URL|Pass1-URL'. Now, assume something goes wrong and your rendering task fails for some reason. Later, when you pick up this message again, you can start from pass 2, instead of from the very beginning.
You don't need to worry about putting messages back on the queue. Messages aren't actually removed from the queue until you explicitly delete them. They simply become invisible during the invisibility timeout period you choose. If you don't delete by the end of that period (or extend the period), the message becomes visible again for someone else to read. Note: At this point, once someone else reads the queue message, the original message-holder will no longer be able to delete the message.

How does one determine if all messages in an Azure Queue have been processed?

I've just begun tinkering with Windows Azure and would appreciate help with a question.
How does one determine if a Windows Azure Queue is empty and that all work-items in it have been processed? If I have multiple worker processes querying a work-item queue, GetMessage(s) returns no messages if the queue is empty. But there is no guarantee that a currently invisible message will not be pushed back into the queue.
I need this functionality since follow-up behavior of my workflow depends on completion of all work-items in that particular queue. A possible way of tackling this problem would be to count the number of puts and deletes. But this will again require synchronization at a shared storage level and I would like to avoid it if possible.
Any ideas?
Take a look at the ApproximateMessageCount method. This should return the number of messages on the queue, including invisible messages (e.g. the ones being processed).
Mike Wood blogged about this subtlety, along with a tidbit about the queue's Clear method, here.
That said: you might want to choose a different mechanism for workflow management. Maybe a table row, where you have your rowkey equal to some multi-queue-item transation id, and individual properties being status flags. This allows you to track failed parts of the transaction (say, 9 out of 10 queue items process ok, the 10th fails; you can still delete the 10th queue item, but set its status flag to failed, then letting you deal with this scenario accordingly). Also: let's say you use the same queue to process another 'transaction' (meaning the queue is again non-zero in length). By using a separate object like a Table Row, you can still determine that your 'transaction' is complete even though there are additional queue messages.
The best way is to have another queue, call it termination indicator queue, and put a message in that queue for every message your process from your main queue. That is how it is done in research projects too. Check this out http://www.cs.gsu.edu/dimos/content/gis-vector-data-overlay-processing-azure-platform.html

Multithreading Task Library, Threading.Timer or threads?

Hi we are building an application that will have the possibility to register scheduled tasks.
Each task has an time interval when it should be executed
Each task should have an timeout
The amount of tasks can be infinite but around 100 in normal cases.
So we have an list of tasks that need to be executed in intervals, which are the best solution?
I have looked at giving each task their timer and when the timer elapses the work will be started, another timer keeps tracks on the timeout so if the timeout is reached the other timer stops the thread.
This feels like we are overusing timers? Or could it work?
Another solution is to use timers for each task, but when the time elapses we are putting the task on a queue that will be read with some threads that executes the work?
Any other good solutions I should look for?
There is not too much information but it looks like that you can consider RX as well - check more at MSDN.com.
You can think about your tasks as generated events which should be composed (scheduled) in some way. So you can do the following:
Spawn cancellable tasks with Observable.GenerateWithDisposable and your own Scheduler - check more at Rx 101 Sample
Delay tasks with Observable.Delay
Wait for tasks with 'Observable.Timeout
Compose tasks in any preferable way
Once again you can check more at specified above links.
You should check out Quartz.NET.
Quartz.NET is a full-featured, open
source job scheduling system that can
be used from smallest apps to large
scale enterprise systems.
I believe you would need to implement your timeout requirement by yourself but all the plumbing needed to schedule tasks could be handled by Quartz.NET.
I have done something like this before where there were a lot of socket objects that needed periodic starts and timeouts. I used a 'TimedAction' class with 'OnStart' and 'OnTimeout' events, (socket classes etc. derived from this), and one thread that handled all the timed actions. The thread maintained a list of TimedAction instances ordered by the tick time of the next action required, (delta queue). The TimedAction objects were added to the list by queueing them to the thread input queue. The thread waitied on this input queue with a timeout, (this was Windows, so 'WaitForSingleObject' on the handle of the semaphore that managed the queue), set to the 'next action required' tick count of the first item in the list. If the queue wait timed out, the relevant action event of the first item in the list was called and the item removed from the list - the next queue wait would then be set by the new 'first item in the list', which would contain the new 'nearest action time'. If a new TimedAction arrived on the queue, the thread calculated its timeout tick time, (GetTickCount + ms interval from the object), and inserted it in the sorted list at the correct place, (yes, this sometimes meant moving a lot of objects up the list to make space).
The events called by the timeout handler thread could not take any lengthy actions in order to prevent delays to the handling of other timeouts. Typically, the event handlers would set some status enumeration, signal some synchro object or queue the TimedAction to some other P-C queue or IO completion port.
Does that make sense? It worked OK, processing thousands of timed actions in my server in a reasonably timely and efficient manner.
One enhancement I planned to make was to use multiple lists with a restricted set of timeout intervals. There were only three const timeout intervals used in my system, so I could get away with using three lists, one for each interval. This would mean that the lists would not need sorting explicitly - new TimedActions would always go to the end of their list. This would eliminate costly insertion of objects in the middle of the list/s. I never got around to doing this as my first design worked well enough and I had plenty other bugs to fix :(
Two things:
Beware 32-bit tickCount rollover.
You need a loop in the queue timeout block - there may be items on the list with exactly the same, or near-same, timeout tick count. Once the queue timeout happens, you need to remove from the list and fire the events of every object until the newly claculated timeout time is >0. I fell foul of this one. Two objects with equal timeout tick count arrived at the head of the list. One got its events fired, but the system tick count had moved on and so the calcualted timeout tick for the next object was -1: INFINITE! My server stopped working properly and eventually locked up :(
Rgds,
Martin

Resources