How to avoid starvation with Azure Service Bus Sessions - azure

I have an Azure Function as a ServiceBusTrigger listening on a queue that has session enabled.
The producer may send messages with a very wide variety of session IDs (let's say 1,000 different values for the session ID).
By default, the Azure Function host will allow 8 concurrent sessions to be processed.
What this means is that out of the 1,000 session IDs, only 8 will be processed at any given time. So when the Azure Function host starts, the first 8 session IDs will have their messages processed. If one of these session IDs is idle for one minute (i.e. if one of these session IDs does not have a message for more than one minute), then its lock will be released and a new session (the ninth one) will have its messages processed.
So what this means is that if the first 8 session IDs each receive at least one message per minute, their lock will not be released and their consumer will not be allowed to process another session ID, thus leaving all remaining 992 session IDs to starve (i.e. they will never be given a chance to get their messages processed).
Obviously, I could update my host.json so that maxConcurrentSessions is set to 1,000. But I don't like this solution because it means that my configuration is hardcoded to my system's current requirements, but these requirements may vary over time i.e. I would have to find a way to monitor that session IDs are not starving because 6 months from now, maybe I would need to increase maxConcurrentSessions to 2,000.
What I am looking for is a mechanism that would auto-adjust itself. For instance, it seems to me that the Azure Service Bus extension is missing a setting that would represent a maximum time-to-live for the lock. For instance, I should be allowed to specify something like:
{
"extensions": {
"serviceBus": {
"sessionIdleTimeout": "00:00:15",
"sessionTimeToLive": "00:00:30"
}
}
}
With a configuration like this, what I would be basically saying is that if a session ID does not receive messages for 15 seconds, then its lock should released so that another session ID can be given a chance to process. Additionally, the TTL would kick in because if that same session ID is constantly receiving a new message every second, then its lock would be forcibly released after 30 seconds despite that session ID having more messages needing to be processed; this way, another session ID is given a chance at processing.
Now given that there is nothing functionally equivalent to sessionTimeToLive in Azure Service Bus to my knowledge, would anyone have an idea on how I am supposed to handle this?

The entity lock duration combined with the "maxAutoLockRenewalDuration" setting already behaves like the proposed "sessionTimeToLive". By default the "maxAutoLockRenewalDuration" is set to 5 minutes, but you can set this to a lower value, (or 0 if you don't want the lock to be renewed at all).
Essentially, the max processing time for a session would be Max(LockDuration, MaxAutoLockRenewalDuration).

Related

Will a Queue Storage message that is placed back in the queue always be placed in the front of the queue?

The docs say for Azure Storage queues that:
Messages in Storage queues are typically first-in-first-out, but
sometimes they can be out of order; for example, when a message's
visibility timeout duration expires (for example, as a result of a
client application crashing during processing). When the visibility
timeout expires, the message becomes visible again on the queue for
another worker to dequeue it. At that point, the newly visible message
might be placed in the queue (to be dequeued again) after a message
that was originally enqueued after it.
I only allow my function app to scale to max 1 instance, so to me it sound like that if the function crashes, the message is placed back in the queue (in the front). And when it restarts it tries the same message, not the next one in the queue. So in this way I will be able guarantee ordering. Does this sound right?
I know I can guarantee ordering with Service Bus using sessions. But I'm trying to avoid it as I have to run this solution with VNETs and then I'll have to use the premium version which is pricey..

In an Azure ServiceBus session enabled subscription why do I receive messages with the same session id on multiple subscriber instances

I have a session enabled subscription on a service bus topic and have four subscription clients running against this subcription.
I publish 10000 messages across 100000 random sessions to the topic and looking at my output I can see multiple subscription clients processing messages for the same session id within seconds of each other, i.e.
SessionId
Client A
ClientB
1234
processing at 10:30:04
1234
processing at 10:30:29
1234
processing at 10:31:00
This was done with the session idle timeout default of 60 seconds.
I then set the session idle timeout to a 2 second timespan. Each SessionProcessor has 100 MaxConcurrentSessions and has AutoCompleteMessages set to false.
I am also observing a lot of errors for "The session lock was lost".
When I receive a session lock exception that session is then started on a different client and the same message is processed on the second client that had already been processed on the client with the session lock exception.
My question is, do I need to record processed message id's before attempting to call CompleteMessageAsync on the ProcessSessionMessageEventArgs so I can then check it for every single message being consumed to avoid processing it again or is there a reason I'm unaware of for my session to lose the session lock - does this happen when the session idle time has been surpassed before calling CompleteMessageAsync?
This is all in test code so my processor is simply writing to the console and doing non long running tasks.
When using a topic, messages are copied and forwarded to all subscriptions, in which the session would be visible for receivers.
If you don't want all receivers to process each sessions, then you should just use a single queue with all receivers listening on it. Then each session and its set of messages would be processed only once, by one of the receivers.

How to make multiple API calls with rate limits per user using RabbitMQ?

In my app I am getting data on behalf of different users via one API which has a rate limit of 1 API call every 2 seconds per user.
Currently I am storing all the calls I need to make in a single message queue. I am using RabbitMQ for this.
There is currently one consumer who is taking one message at a time, doing the call, processing the result and then start with the next message.
The queue is filling up faster than this single consumer can make the API calls (1 call every 2 seconds as I don't know which user comes next and I don't want to hit API limits).
My problem is now that I don't know how to add more consumers which in theory would be possible as the queue holds jobs for different users and the API rate limit is per user so e.g. I could do 2 API calls every 2 seconds if they are from different users.
However I have no information about the messages in the queue. Could be from a single user, could be from many different users.
Only solution I see right now is to create separate queues for each user. But I have many different users (say 1,000) and would rather stay with 1 queue.
If possible I would stick with RabbitMQ as I use this for other similar tasks as well. But if I need to change my stack I would be willing to do so.
App is using the MEAN stack.
You will need to maintain a state somewhere, I had a similar application and what i did was maintain state in Redis, before every call check if user has made request in last 2 seconds eg:
Redis key:
user:<user_id> // value is epoch time-stamp
update Redis once request is made.
refrence:
redis

What assumptions can I make about global time on Azure?

I want my Azure role to reprocess data in case of sudden failures. I consider the following option.
For every block of data to process I have a database table row and I could add a column meaning "time of last ping from a processing node". So when a node grabs a data block for processing it sets "processing" state and that time to "current time" and then it's the node responsibility to update that time say every one minute. Then periodically some node will ask for "all blocks that have processing state and ping time larger than ten minutes" and consider those blocks as abandoned and somehow queue them for reprocessing.
I have one very serious concern. The above approach requires that nodes have more or less the same time. Can I rely on all Azure nodes having the same time with some reasonable precision (say several seconds)?
For processing times under 2 hrs, you can usually rely on queue semantics (visibility timeout). If you have the data stored in blob storage, you can have a worker pop a queue message containing the name of the blob to work on and set a reasonable visibility timeout on the message (up to 2 hrs today). Once it completes the work, it can delete the queue message. If it fails, the delete is never called and after the visibility timeout, it will reappear on the queue for reprocessing. This is why you want your work to be idempotent, btw.
For processing that lasts longer than two hours, I generally recommend a leasing strategy where the worker leases the underlying blob data (if possible or a dummy blob otherwise) using the intrisic lease functionality in Windows Azure blob storage. When a worker goes to retrieve a file, it tries to lease it. A file that is already leased is indicative of a worker role currently processing it. If failure occurs, the lease will be broken and it will become leasable by another instance. Leases must be renewed every min or so, but they can be held indefinitely.
Of course, you are keeping the data to be processed in blob storage, right? :)
As already indicated, you should not rely on synchronized times between VM nodes. If you store datetimes for any reason - use UTC or you will be sorry later.
The answer here isn't to use time based synchronization (if you would however, make sure you use UTCNow), but there is still no guarantee anywhere that the clocks are synced. Nor should there be.
For the problem you are describing a queue based system is the answer. I've been referencing a lot to it, and will do it again, but I've explained some benefits of queue based systems in my blog post.
The idea is the following:
You put a work item to the queue
Your worker role (one or many of them) peeks & locks the message
You try to process the message, if you succeed, you remove the message from the queue,
if not, you let it stay where it is
With your approach I would use AppFabric Queues because you can also have topics & subscriptions, which allows you to monitor the data items. The example in my blog post coveres this exact scenario, with the only difference being that instead of having a worker role I poll the queue from my web application. But the concept is the same.
I would try this a different way using queue storage. If you pop your block of data on a queue with a timeout then have your processing nodes (worker roles?) pull this data off the queue.
After the data is popped off the queue if the processing node does not delete the entry from the queue it will reappear on the queue for processing after the timeout period.
Remote desktop into a role instance and check (a) the time zone (UTC, I think), and (b) that Internet Time is enabled in Date and Time settings. If so then you can rely on them being no more than a few ms apart. (This is not to say that the suggestions to use a message queue instead won't work, but perhaps they do not suit your needs.)

Controlling azure worker roles concurrency in multiple instance

I have a simple work role in azure that does some data processing on an SQL azure database.
The worker basically adds data from a 3rd party datasource to my database every 2 minutes. When I have two instances of the role, this obviously doubles up unnecessarily. I would like to have 2 instances for redundancy and the 99.95 uptime, but do not want them both processing at the same time as they will just duplicate the same job. Is there a standard pattern for this that I am missing?
I know I could set flags in the database, but am hoping there is another easier or better way to manage this.
Thanks
As Mark suggested, you can use an Azure queue to post a message. You can have the worker role instance post a followup message to the queue as the last thing it does when processing the current message. That should deal with the issue Mark brought up regarding the need for a semaphore. In your queue message, you can embed a timestamp marking when the message can be processed. When creating a new message, just add two minutes to current time.
And... in case it's not obvious: in the event the worker role instance crashes before completing processing and fails to repost a new queue message, that's fine. In this case, the current queue message will simply reappear on the queue and another instance is then free to process it.
There is not a super easy way to do this, I dont think.
You can use a semaphore as Mark has mentioned, to basically record the start and the stop of processing. Then you can have any amount of instances running, each inspecting the semaphore record and only acting out if semaphore allows it.
However, the caveat here is that what happens if one of the instances crashes in the middle of processing and never releases the semaphore? You can implement a "timeout" value after which other instances will attempt to kick-start processing if there hasnt been an unlock for X amount of time.
Alternatively, you can use a third party monitoring service like AzureWatch to watch for unresponsive instances in Azure and start a new instance if the amount of "Ready" instances is under 1. This will save you can save some money by not having to have 2 instances up and running all the time, but there is a slight lag between when an instance fails and when a new one is started.
A Semaphor as suggested would be the way to go, although I'd probably go with a simple timestamp heartbeat in blob store.
The other thought is, how necessary is it? If your loads can sustain being down for a few minutes, maybe just let the role recycle?
Small catch on David's solution. Re-posting the message to the queue would happen as the last thing on the current execution so that if the machine crashes along the way the current message would expire and re-surface on the queue. That assumes that the message was originally peeked and requires a de-queue operation to remove from the queue. The de-queue must happen before inserting the new message to the queue. If the role crashes in between these 2 operations, then there will be no tokens left in the system and will come to a halt.
The ESB dup check sounds like a feasible approach, but it does not sound like it would be deterministic either since the bus can only check for identical messages currently existing in a queue. But if one of the messages comes in right after the previous one was de-queued, there is a chance to end up with 2 processes running in parallel.
An alternative solution, if you can afford it, would be to never de-queue and just lease the message via Peek operations. You would have to ensure that the invisibility timeout never goes beyond the processing time in your worker role. As far as creating the token in the first place, the same worker role startup strategy described before combined with ASB dup check should work (since messages would never move from the queue).

Resources