Cosmos DB: How to retry failures with TransactionalBatch

Cosmos DB: How to retry failures with TransactionalBatch - azure

I have a few stored procedures in Cosmos DB that I'd like to convert to .NET transactions. Recently, I saw this post https://devblogs.microsoft.com/cosmosdb/introducing-transactionalbatch-in-the-net-sdk/ that goes over transaction support. I was also able to test it, and it seems to be working fine.
I know that .NET has added built-in retry logic into many of its supported packages. Does TransactionalBatch have any built-in retry policy? What is the recommended approach to retrying any failures? The post above is looking at IsSuccessStatusCode. Should we retry once the status is fail?

Does TransactionalBatch have any built-in retry policy?
For now, it does not support built-in retry policy.
What is the recommended approach to retrying any failures?
TransactionalBatch describes a group of point operations that need to either succeed or fail.If any operation fails, the entire transaction is rolled back.
Because the failed status code will be 424 and 409, so we could not use RetryOptions.MaxRetryAttemptsOnThrottledRequests.
So, you could use for (int i = 0; i < MaxRetries; i++){} to perform the retry logic.

Related

Bulk delete in Azure CosmosDB leads to 429 error

I've implemented bulk deletion as recommended with newer SDK. Created a list of tasks to delete each item and then awaited them all. And my CosmosClient was configured with BulkOperations = true. As I understand, it's implied that under the hood new SDK does its magic and performs bulk operation.
Unfortunatelly, I've encountered 429 response status. Meaning my multiple requests hit request rate limit (it is low, development only tier, but nontheless). I wonder, how single bulk operation might cause 429 error. And how to implement bulk deletion in not "per item" fashion.
UPDATE: I use Azure Cosmos DB .NET SDK v3 for SQL API with bulk operations support as described in this article https://devblogs.microsoft.com/cosmosdb/introducing-bulk-support-in-the-net-sdk/

You need to handle 429s for deletes the way you'd handle for any operation by creating an exception block, trapping for the status code, then checking the retry-after value in the header, then sleeping and retrying after that amount of time.
PS if you're trying to delete all the data in the container, it can be more efficient to delete then recreate the container.

MongoError: WriteConflict error: this operation conflicted with another operation. Please retry your operation or mult

I am trying to update a document using two parallel multi-document transactions and I get the following error:
‍‍‍
MongoError: WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction.
How can I fix this?

WriteConflicts occur in mongodb when two or more write operations try to modify a document at the same time. Since mongodb uses optimistic concurrency control, it fails the latter operation and retries the latter write operation internally.
Transactions in mongodb can be implemented in 2 ways:
Core api - the retry logic is not implemented internally but rather left for the developers to incorporate
Callback api - the retry logic is already incorporated
I believe you are using the core api approach and that's why it is giving this error. Try switching to callback api approach to solve it.

DynamoDB for ServiceStack 4.0.48

In my experience working with DynamoDB and its provisioned throughput, the limits often are hit in normal usage. To work around this, I have used retry approaches such as Polly transient exception handling to simplify retry logic.
Does anyone know if there is any mechanism in ServiceStack to account for DynamoDB throughput limits in the current release of ServiceStack.AWS?

Yes all PocoDynamo API's are executed within a managed context where temporary errors are automatically retried using Amazons recommended exponential backoff.
The retry Exception Types are defined on PocoDynamo client which defaults to:
RetryOnErrorCodes = new HashSet<string> {
"ThrottlingException",
"ProvisionedThroughputExceededException",
"LimitExceededException",
"ResourceInUseException",
};

Windows Azure Service Bus Queues: Throttling and TOPAZ

Today at a customer we analysed the logs of the previous weeks and we found the following issue regarding Windows Azure Service Bus Queues:
The request was terminated because the entity is being throttled.
Please wait 10 seconds and try again.
After verifying the code I told them to use the Transient Fault Handing Application Block (TOPAZ) to implement a retry policy like this one:
var retryStrategy = new Incremental(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(2));
var retryPolicy = new RetryPolicy<ServiceBusTransientErrorDetectionStrategy>(retryStrategy);
The customer answered:
"Ah that's great, so it will also handle the fact that it should wait
for 10 seconds when throttled."
Come to think about it, I never verified if this was the case or not. I always assumed this was the case. In the Microsoft.Practices.EnterpriseLibrary.WindowsAzure.TransientFaultHandling assembly I looked for code that would wait for 10 seconds in case of throttling but didn't find anything.
Does this mean that TOPAZ isn't sufficient to create resilient applications? Should this be combined with some custom code to handle throttling (ie: wait 10 seconds in case of a specific exception)?

As far as throttling concerned, Topaz provides a set of built-in retry strategies, including:
- Fixed interval
- Incremental intervals
- Random exponential back-off intervals
You can also write your custom retry stragey and plug-it into Topaz.
Also, as Brent indicated, 10 sec wait is not mandatory. In many cases, retrying immediately may succeed without the need to wait. By default, Topaz performs the first retry immediately before using the retry intervals defined by the strategy.
For more info, see Ch.6 of the "Building Elastic and Resilient Cloud Apps" Developer's Guide, also available as epub/mobi/pdf from here.
If you have suggestions/feature requests for Topaz, please submit them via the uservoice.

As I recall, the "10 second" wait isn't a requirement. Additionally, TOPAZ I believe also has backoff capabilities which would help you over come thing.
On a personal note, I'd argue that simply utilzing something like TOPAZ is not sufficient to creating a truely resilient solution. Resiliency goes beyond just throttling on a single connection point, you'll also need to be able to handle failover to a redundant endpoint which TOPAZ won't do.

Azure Table Storage RetryPolicy questions

A couple questions on using RetryPolicy with Table Storage,
Is it best practice to use RetryPolicy whenever you can, hence use ctx.SaveChangeWithRetries() instead of ctx.SaveChanges() accordingly whenever you can?
When you do use RetryPolicy, for example,
ctx.RetryPolicy = RetryPolicies.Retry(5, TimeSpan.FromSeconds(1));
What values do people normally use for the retryCount and the TimeSpan? I see 5 retries and 1 second TimeSpan are a popular choice, but would 5 retries 1 second each be too long?
Thank you,
Ray.

I think this is highly dependent on your application and requirements. The timeout errors to ATS happen so rarely that if a retry policy will not hurt to have in place and would be rarely utilized anyway. But if something fishy is happening, it may save yourself from having to debug weird errors.
Now, I would suggest that in the beginning you do not enable the RetryPolicy at all and have tracing instead so that you can see any issues with persistence to ATS. Once you're stabilized, putting a RetryPolicy maybe good idea to work around some runtime glitches on the ATS side. Just make sure you're not masking your own problems with RetryPolicy.

If your client is user facing like a web page you would probably like to use a linear retry with short waits (milliseconds) in between each retry, if your client is actually a non user facing backend service etc. then you would most likely want to use Exponential retries in order not to overload the table storage service in case it is already giving 5xx errors due to high load for instance.
Using the latest Azure Storage client SDK, if you do not define any retry policy in your table requests via the TableRequestOptions, then the default retry policy is used which is the Exponential retry. The sdk makes 3 retries in total for the errors that it deems retriable and this in total takes more or less 20 seconds if all retries fail.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cosmos DB: How to retry failures with TransactionalBatch - azure

Related

Bulk delete in Azure CosmosDB leads to 429 error

MongoError: WriteConflict error: this operation conflicted with another operation. Please retry your operation or mult

DynamoDB for ServiceStack 4.0.48

Windows Azure Service Bus Queues: Throttling and TOPAZ

Azure Table Storage RetryPolicy questions

Categories

Resources