Windows Azure Service Bus Queues: Throttling and TOPAZ - azure

Today at a customer we analysed the logs of the previous weeks and we found the following issue regarding Windows Azure Service Bus Queues:
The request was terminated because the entity is being throttled.
Please wait 10 seconds and try again.
After verifying the code I told them to use the Transient Fault Handing Application Block (TOPAZ) to implement a retry policy like this one:
var retryStrategy = new Incremental(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(2));
var retryPolicy = new RetryPolicy<ServiceBusTransientErrorDetectionStrategy>(retryStrategy);
The customer answered:
"Ah that's great, so it will also handle the fact that it should wait
for 10 seconds when throttled."
Come to think about it, I never verified if this was the case or not. I always assumed this was the case. In the Microsoft.Practices.EnterpriseLibrary.WindowsAzure.TransientFaultHandling assembly I looked for code that would wait for 10 seconds in case of throttling but didn't find anything.
Does this mean that TOPAZ isn't sufficient to create resilient applications? Should this be combined with some custom code to handle throttling (ie: wait 10 seconds in case of a specific exception)?

As far as throttling concerned, Topaz provides a set of built-in retry strategies, including:
- Fixed interval
- Incremental intervals
- Random exponential back-off intervals
You can also write your custom retry stragey and plug-it into Topaz.
Also, as Brent indicated, 10 sec wait is not mandatory. In many cases, retrying immediately may succeed without the need to wait. By default, Topaz performs the first retry immediately before using the retry intervals defined by the strategy.
For more info, see Ch.6 of the "Building Elastic and Resilient Cloud Apps" Developer's Guide, also available as epub/mobi/pdf from here.
If you have suggestions/feature requests for Topaz, please submit them via the uservoice.

As I recall, the "10 second" wait isn't a requirement. Additionally, TOPAZ I believe also has backoff capabilities which would help you over come thing.
On a personal note, I'd argue that simply utilzing something like TOPAZ is not sufficient to creating a truely resilient solution. Resiliency goes beyond just throttling on a single connection point, you'll also need to be able to handle failover to a redundant endpoint which TOPAZ won't do.

Related

Backoff Strategy after hitting rate limits

When you hit the rate limits on getstream, the APIs start responding with errors.
What is the recommended approach as a backoff strategy to handle those failures and start recovery after that. I thought about logging them all and send all of them again after a minute or hour.
But what if user created a post (failed to be created on getstream, waiting for a backoff) and meanwhile user deletes it. The backoff script will send the post to getstream even if user deleted it.
What is recommended by getstream or anyone handled the situation like that?
As you point out, API rate-limit errors are typically handled with (exponential) backoff solutions.
This often involves additional application logic (flow control and queues) and special purpose data services / storage (message queues, async workers etc). This can add quite some complexity to an application.
When it comes to the Stream service, being rate-limited is usually an indication of either a flaw/deficiency in the implementation (much like a performance bug) or that the application has reached a scale that is beyond that the current plan is intended to support.
It'd be wise to contact Stream support directly about this.

Can I use a retry policy in an azure function?

I'm using event hubs to temporary store data which will first be saved to azure table storage and then indexed to elasticsearch.
I was thinking that I should do the storage saving calls in an azure function, and do the same for the elasticsearch indexing using NEST.
It is important that the data is processed, so I was thinking that I'll use Polly as a retry policy in case the elasticsearch server is failing. However, won't a retry policy potentially make the azure function expensive?
Is azure functions even the right way to go?
Yes, you can use Polly for retries inside your Azure Functions. Some further considerations:
Yes, you will pay for the retry time. But given that your Elastic Search is "mostly up", the extra price for occasional retries should not be too high.
If you want to retry saving to Table Storage too, you will have to write calls decorated with Polly yourself instead of otherwise preferred output binding
Make sure to check if order of writes is important to you and whether you should retry Table Storage writes to completion before you start writing to Elastic, or vice versa. Otherwise you can do them in parallel with async and then Task.WaitAll
The maximum execution time of a Function is 5 minutes by default, you can configure it up to 10 minutes max. If you need to handle outages longer than that, you probably need a plan B. E.g. start copying the events that are failing for longer than 4 (or 9) minutes to a dedicated Queue, and retry from there. Or disabling the Function for such periods of downtime.
Yes it is. You could use a library or better just write a simple linear backoff strategy —
like try 5 times with 5 seconds sleep in between — and do something like
context.log.error({
message: `Transient failure. This is Retry number ${retryCount}.`,
errorCode: errorCodeFromCallingElasticSearch,
errorDetails: moreContextMaybeSomeStack
});
every time you hit the retry logic so it goes to App Insights (make sure you integrate with App Insights, else you have no ops or it's completely dark ops).
You can then query for how often is it really a miss and get an idea on how well things go at the 95% percentile.
Occasionally running 10 seconds over the normal 1 second execution time for your function is going to cost extra, but probably nowhere near a full dedicated App Service Plan. If it comes close, just switch to that, it means your function is mostly on rather than off - which is still a perfectly good case for running a function.
App Insights can also trigger alerts if some metric goes haywire, like your retry count goes up to 11 for 24 hours, you probably want to know about that deviation. You'll need to send the retry count as a custom metric to trigger an alert off of it:
context.log.metric("CallElasticSearchRetryCount", retryCount);

How do Azure Functions scale out?

The scaling documentation for Azure Functions is a bit light on details for how Azure Functions decide when to add more instances of an app.
Say for example I have a function that is triggered by a Github webhook. 10,000 people simultaneously commit to the Github repo (with no merge conflicts ;) ), and Github calls my function 10,000 times in a very short period of time.
What can I expect to happen? Specifically,
Will Azure Functions throttle the webhook calls? i.e., will Azure Functions reject certain function calls if my function app is under high load?
Does Azure Functions queue the requests somehow? If so, where/how?
How many instances of my function app will Azure Functions create in this scenario? One for each request (i.e., 10,000), and each will run in parallel?
If my app function was scaled down to zero instances, because there was no load on it, could I expect to see some "warm-up time" before the first function is executed? Roughly how long?
Azure Functions won't reject a webhook call, but in the case of sudden, extreme load, some requests may timeout. For web apis, please include retry on the client, as a best practice.
They aren't queued in any persistent place. They are (implementation detail) managed by IIS.
(Implementation detail) Number of instances isn't a hard set thing. We have certain, unpublished protections in place, but we're designed to scale quite far. Your requests will be handled by multiple instances.
Yes. Right now, it's pretty hefty (seconds), but we'll be working to improve it. For perf sensitive situations, a canary or a timer trigger to keep it awake is recommended.
I'm from the Azure Functions team. The things I marked as implementation details aren't promises and will likely also change as we evolve our service; just an attempt at transparency.
tested today. it took more than seconds :(
ACTUAL PERFORMANCE
--------------
ClientConnected: 13:58:41.589
ClientBeginRequest: 13:58:41.592
GotRequestHeaders: 13:58:41.592
ClientDoneRequest: 13:58:41.592
Determine Gateway: 0ms
DNS Lookup: 65ms
TCP/IP Connect: 40ms
HTTPS Handshake: 114ms
ServerConnected: 13:58:41.703
FiddlerBeginRequest: 13:58:41.816
ServerGotRequest: 13:58:41.817
ServerBeginResponse: 14:00:36.790
GotResponseHeaders: 14:00:36.790
ServerDoneResponse: 14:00:36.790
ClientBeginResponse: 14:00:36.790
ClientDoneResponse: 14:00:36.790
Overall Elapsed: **0:01:55.198**

How frequently the Traffic Manager monitors endpoints

How frequently the Traffic Manager monitors endpoints? It's very obvious that it's not event driven (when an endpoint is down it takes up-to 30 secs - 2.5 mins to identify the status of the endpoint as per my observations). Can we configure this frequency, I cannot see any configuration for this.
Is there a relationship between Traffic Manager Monitoring interval and TTL?
This may look like a general question, but my real issue is that I experience a service downtime in a fail over scenario (fail over of the primary). I understand the effect in TTL where until the client DNS cache expires they are calling the cached endpoint. I spent a lot of time on this and now I have narrowed down it to a specific question.
Issue is that there is a delay in Traffic Manager identifying the endpoint status after it's stopped or started. I need a logical explanation for this, could not find any Azure reference which explains this.
Traffic manager settings
I need to understand this delay and plan for that down time.
I have gone through the same issue. Check this link, it explains the Monitoring behaviour
Traffic Manager Monitoring
The monitoring system performs a GET, but does not receive a response in 10 seconds or less. It then performs three more tries at 30 second intervals. This means that at most, it takes approximately 1.5 minutes for the monitoring system to detect when a service becomes unavailable. If one of the tries is successful, then the number of tries is reset. Although not shown in the diagram, if the 200 OK message(s) come back more than 10 seconds after the GET, the monitoring system will still count this as a failed check.
This explains the 30-2 mins delay.
basically the maximum delay would be 1.5 mins + TTL as per the details.

Azure Table Storage RetryPolicy questions

A couple questions on using RetryPolicy with Table Storage,
Is it best practice to use RetryPolicy whenever you can, hence use ctx.SaveChangeWithRetries() instead of ctx.SaveChanges() accordingly whenever you can?
When you do use RetryPolicy, for example,
ctx.RetryPolicy = RetryPolicies.Retry(5, TimeSpan.FromSeconds(1));
What values do people normally use for the retryCount and the TimeSpan? I see 5 retries and 1 second TimeSpan are a popular choice, but would 5 retries 1 second each be too long?
Thank you,
Ray.
I think this is highly dependent on your application and requirements. The timeout errors to ATS happen so rarely that if a retry policy will not hurt to have in place and would be rarely utilized anyway. But if something fishy is happening, it may save yourself from having to debug weird errors.
Now, I would suggest that in the beginning you do not enable the RetryPolicy at all and have tracing instead so that you can see any issues with persistence to ATS. Once you're stabilized, putting a RetryPolicy maybe good idea to work around some runtime glitches on the ATS side. Just make sure you're not masking your own problems with RetryPolicy.
If your client is user facing like a web page you would probably like to use a linear retry with short waits (milliseconds) in between each retry, if your client is actually a non user facing backend service etc. then you would most likely want to use Exponential retries in order not to overload the table storage service in case it is already giving 5xx errors due to high load for instance.
Using the latest Azure Storage client SDK, if you do not define any retry policy in your table requests via the TableRequestOptions, then the default retry policy is used which is the Exponential retry. The sdk makes 3 retries in total for the errors that it deems retriable and this in total takes more or less 20 seconds if all retries fail.

Resources