A couple questions on using RetryPolicy with Table Storage,
Is it best practice to use RetryPolicy whenever you can, hence use ctx.SaveChangeWithRetries() instead of ctx.SaveChanges() accordingly whenever you can?
When you do use RetryPolicy, for example,
ctx.RetryPolicy = RetryPolicies.Retry(5, TimeSpan.FromSeconds(1));
What values do people normally use for the retryCount and the TimeSpan? I see 5 retries and 1 second TimeSpan are a popular choice, but would 5 retries 1 second each be too long?
Thank you,
Ray.
I think this is highly dependent on your application and requirements. The timeout errors to ATS happen so rarely that if a retry policy will not hurt to have in place and would be rarely utilized anyway. But if something fishy is happening, it may save yourself from having to debug weird errors.
Now, I would suggest that in the beginning you do not enable the RetryPolicy at all and have tracing instead so that you can see any issues with persistence to ATS. Once you're stabilized, putting a RetryPolicy maybe good idea to work around some runtime glitches on the ATS side. Just make sure you're not masking your own problems with RetryPolicy.
If your client is user facing like a web page you would probably like to use a linear retry with short waits (milliseconds) in between each retry, if your client is actually a non user facing backend service etc. then you would most likely want to use Exponential retries in order not to overload the table storage service in case it is already giving 5xx errors due to high load for instance.
Using the latest Azure Storage client SDK, if you do not define any retry policy in your table requests via the TableRequestOptions, then the default retry policy is used which is the Exponential retry. The sdk makes 3 retries in total for the errors that it deems retriable and this in total takes more or less 20 seconds if all retries fail.
Related
I'm using event hubs to temporary store data which will first be saved to azure table storage and then indexed to elasticsearch.
I was thinking that I should do the storage saving calls in an azure function, and do the same for the elasticsearch indexing using NEST.
It is important that the data is processed, so I was thinking that I'll use Polly as a retry policy in case the elasticsearch server is failing. However, won't a retry policy potentially make the azure function expensive?
Is azure functions even the right way to go?
Yes, you can use Polly for retries inside your Azure Functions. Some further considerations:
Yes, you will pay for the retry time. But given that your Elastic Search is "mostly up", the extra price for occasional retries should not be too high.
If you want to retry saving to Table Storage too, you will have to write calls decorated with Polly yourself instead of otherwise preferred output binding
Make sure to check if order of writes is important to you and whether you should retry Table Storage writes to completion before you start writing to Elastic, or vice versa. Otherwise you can do them in parallel with async and then Task.WaitAll
The maximum execution time of a Function is 5 minutes by default, you can configure it up to 10 minutes max. If you need to handle outages longer than that, you probably need a plan B. E.g. start copying the events that are failing for longer than 4 (or 9) minutes to a dedicated Queue, and retry from there. Or disabling the Function for such periods of downtime.
Yes it is. You could use a library or better just write a simple linear backoff strategy —
like try 5 times with 5 seconds sleep in between — and do something like
context.log.error({
message: `Transient failure. This is Retry number ${retryCount}.`,
errorCode: errorCodeFromCallingElasticSearch,
errorDetails: moreContextMaybeSomeStack
});
every time you hit the retry logic so it goes to App Insights (make sure you integrate with App Insights, else you have no ops or it's completely dark ops).
You can then query for how often is it really a miss and get an idea on how well things go at the 95% percentile.
Occasionally running 10 seconds over the normal 1 second execution time for your function is going to cost extra, but probably nowhere near a full dedicated App Service Plan. If it comes close, just switch to that, it means your function is mostly on rather than off - which is still a perfectly good case for running a function.
App Insights can also trigger alerts if some metric goes haywire, like your retry count goes up to 11 for 24 hours, you probably want to know about that deviation. You'll need to send the retry count as a custom metric to trigger an alert off of it:
context.log.metric("CallElasticSearchRetryCount", retryCount);
I have a document db database on azure. I have a particularly heavy query that happens when I archive a user record and all of their data.
I was on the S1 plan and would get an exception that indicated I was hitting the limit of RU/s. The S1 plan has 250.
I decided to switch to the Standard plan that lets you set the RU/s and pay for it.
I set it to 500 RU/s.
I did the same query and went back and looked at the monitoring chart.
At the time I did this latest query test it said I did 226 requests and 10 were throttled.
Why is that? I set it to 500 RU/s. The query had failed, by the way.
Firstly, Requests != Request Units, so your 226 requests will at some point have caused more than 500 Request Units to be needed within one second.
The DocumentDb API will tell you how many RUs each request costs, so you can examine that client side to find out which request is causing the problem. From my experience, even a simple by-id request often cost at least a few RUs.
How you see that cost is dependent on which client-side SDK you use. In my code, I have added something to automatically log all requests that cost more than 10 RUs, just so I know and can take action.
It's also the case that the monitoring tools in the portal are quite inadequate and I know the team are working on that; you can only see the total RUs for every five minute interval, but you may try to use 600 RUs in one second and you can't really see that in the portal.
In your case, you may either have a single big query that just costs more than 500 RU - the logging will tell you. In that case, look at the generated SQL to see why, maybe even post it here.
Alternatively, it may be the cumulative effect of lots of small requests being fired off in a small time window. If you are doing 226 requests in response to one user action (and I don't know if you are) then you probably want to reconsider your design :)
Finally, you can retry failed requests. I'm not sure about other SDKs but the .Net SDK retries a request automatically 9 times before giving up (that might be another explanation for the 229 requests hitting the server).
If your chosen SDK doesn't retry, you can easily do it yourself; the server will return a specific status code (I think 429 but can't quite remember) along with an instruction on how long to wait before retrying.
Please examine the queries and update your question so we can help further.
I am using the Datastax Cassandra driver and have a RetryPolicy setup to retry when a host is unavailable. However, I have noticed that it retries as fast as it can. I would like to change it to have an increasing delay between retries rather than hammer the cluster if it is struggling. This is particularly important for OVERLOADED request errors since I do want to retry in these scenarios, but with a substantial delay.
Where is the right place to put a delay and what is the right mechanism? Should I just throw a Thread.sleep(...) in my RetryPolicy?
I don't mind taking up a request on-the-wire slot (towards the maximum number of in-flight requests) but I am not okay with completely blocking other writes if we are not yet at the in-flight request limit.
You can implement your own retry policy by adding a delay. The simplest way is to pick the source code of the default retry and modify it yourself to implement an exponential delay for retry or something similar.
For exponential delay, just look at the source code of http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/ExponentialReconnectionPolicy.html to see how it works
In my experience working with DynamoDB and its provisioned throughput, the limits often are hit in normal usage. To work around this, I have used retry approaches such as Polly transient exception handling to simplify retry logic.
Does anyone know if there is any mechanism in ServiceStack to account for DynamoDB throughput limits in the current release of ServiceStack.AWS?
Yes all PocoDynamo API's are executed within a managed context where temporary errors are automatically retried using Amazons recommended exponential backoff.
The retry Exception Types are defined on PocoDynamo client which defaults to:
RetryOnErrorCodes = new HashSet<string> {
"ThrottlingException",
"ProvisionedThroughputExceededException",
"LimitExceededException",
"ResourceInUseException",
};
Today at a customer we analysed the logs of the previous weeks and we found the following issue regarding Windows Azure Service Bus Queues:
The request was terminated because the entity is being throttled.
Please wait 10 seconds and try again.
After verifying the code I told them to use the Transient Fault Handing Application Block (TOPAZ) to implement a retry policy like this one:
var retryStrategy = new Incremental(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(2));
var retryPolicy = new RetryPolicy<ServiceBusTransientErrorDetectionStrategy>(retryStrategy);
The customer answered:
"Ah that's great, so it will also handle the fact that it should wait
for 10 seconds when throttled."
Come to think about it, I never verified if this was the case or not. I always assumed this was the case. In the Microsoft.Practices.EnterpriseLibrary.WindowsAzure.TransientFaultHandling assembly I looked for code that would wait for 10 seconds in case of throttling but didn't find anything.
Does this mean that TOPAZ isn't sufficient to create resilient applications? Should this be combined with some custom code to handle throttling (ie: wait 10 seconds in case of a specific exception)?
As far as throttling concerned, Topaz provides a set of built-in retry strategies, including:
- Fixed interval
- Incremental intervals
- Random exponential back-off intervals
You can also write your custom retry stragey and plug-it into Topaz.
Also, as Brent indicated, 10 sec wait is not mandatory. In many cases, retrying immediately may succeed without the need to wait. By default, Topaz performs the first retry immediately before using the retry intervals defined by the strategy.
For more info, see Ch.6 of the "Building Elastic and Resilient Cloud Apps" Developer's Guide, also available as epub/mobi/pdf from here.
If you have suggestions/feature requests for Topaz, please submit them via the uservoice.
As I recall, the "10 second" wait isn't a requirement. Additionally, TOPAZ I believe also has backoff capabilities which would help you over come thing.
On a personal note, I'd argue that simply utilzing something like TOPAZ is not sufficient to creating a truely resilient solution. Resiliency goes beyond just throttling on a single connection point, you'll also need to be able to handle failover to a redundant endpoint which TOPAZ won't do.