DefaultRetryPolicy - write timeout - cassandra

The documentation for DefaultRetryPolicy says that
This policy retries queries in only two cases:
On a read timeout, if enough replicas replied but data was not retrieved.
On a write timeout, if we timeout while writing the
distributed log used by batch statements. This retry policy is
conservative in that it will never retry with a different consistency
level than the one of the initial operation.
Does this mean that when I do a simple session.execute(BoundStatement) without using any custom retry policy and get a write time out that the default retry policy will kick in and there will be a retry to write the data again ? What does the "distributed log used by batch statements" mean ?

If you don't specify any specific retry policy , driver will use DefaultRetryPolicy
By default, Retry on write timeout in applicable for Logged Batch operation (logged batch enforces atomicity).
no retry will happen on write timeout in case of non batch operation

Related

Azure Event Hub - How to achieve infinite retry?

EventHub consumer need to process the message it received until it succeeds during the transient faults, how to achieve this infinite retry by honoring the EventHub partition lease expiry?
Here the business scenario is not important but the approach for infinite retry (by considering partition lease expiry) is what I'm looking for.
Note: I'm reading the message in batches, processing of any message can encounter transient faults which need to retry. So driving some logic with an 'offset' value may not be efficient but not sure anyone has achieved infinite retries by leveraging offset value.
The consumer can retry on transient failures indefinitely until cancellation is requested. By the way, the lease won't expire due to retry possibly taking longer than expected.
Please check the API documentation for more reference. https://learn.microsoft.com/en-us/dotnet/api/azure.messaging.eventhubs.processor.processeventargs?view=azure-dotnet
CancellationToken
A CancellationToken to indicate that the processor is requesting that the handler stop its activities. If this token is requesting cancellation, then either the processor is attempting to shutdown or ownership of the partition has changed.

Timeout error when creating ServiceBusMessageBatch in Azure.Messaging.ServiceBus

I have the following code where I start getting an error during long-running tests on the same Service Bus Client.
ServiceBusMessageBatch batch = this._serviceBusSender.CreateMessageBatchAsync().GetAwaiter().GetResult();
The error is,
Azure.Messaging.ServiceBus.ServiceBusException: 'The operation did not complete within the allocated time 00:01:00 for object request42. (ServiceTimeout)'
Why is this statement throwing this error? Is the creation of a batch object such a heavy operation that it can even timeout? If this is the case, should I switch to the overload of using the List of ServiceBusMessage instead of this batch mode?
My understanding is that this way of batch creation can protect me from creating a batch that the queue may not allow. I am finding it difficult to understand why it times out after 1 min
.
In order for a batch to be able to enforce limits on the size, it has to establish an AMQP link to the entity that you'll be sending to and read the maximum allowable message size from the service. This results in a network operation that, in this case, timed out. This overhead is performed only in the case that there is not an existing AMQP link already established - typically on the first call that requires a network operation.
What jumps out at me from your code is the use of GetAwaiter().GetResult() to perform sync-over-async. This is really not a good idea and is very likely to cause contention in the thread pool that prevents continuations from being scheduled in a timely manner. Because network operations in Service Bus are asynchronous - including establishing the AMQP link - delays in scheduling continuations would certainly increase the chance of timeouts.
I'd strongly advise refactoring your sync-over-async code paths and shifting to an asynchronous approach. In those scenarios where it's not possible to go full async, limiting sync-over-async to the outermost layer of your code would be the next best thing.

Does Cassandra log query attempts that are part of a retry policy?

For example, would these attempts be recorded as part of a trace session in system_traces.sessions or system_traces.events?
Edit: The driver I'm using is called gocql
In the Java driver, there is a logging retry policy which can act as a parent policy for another retry policy - it should log the decision of retrying.
In the gocql driver though looking at the query executor, I cannot see an explicit log regardless of retry - only one of the retry mechanisms appears to have logging, the DowngradingConsistencyRetryPolicy. If debug is set it will log the downgrade.

What should be done when the provisioned throughput is exceeded?

I'm using AWS SDK for Javascript (Node.js) to read data from a DynamoDB table. The auto scaling feature does a great job during most of the time and the consumed Read Capacity Units (RCU) are really low most part of the day. However, there's a programmed job that is executed around midnight which consumes about 10x the provisioned RCU and since the auto scaling takes some time to adjust the capacity, there are a lot of throttled read requests. Furthermore, I suspect my requests are not being completed (though I can't find any exceptions in my error log).
In order to handle this situation, I've considered increasing the provisioned RCU using the AWS API (updateTable) but calculating the number of RCU my application needs may not be straightforward.
So my second guess was to retry failed requests and simply wait for auto scale increase the provisioned RCU. As pointed out by AWS docs and some Stack Overflow answers (particularlly about ProvisionedThroughputExceededException):
The AWS SDKs for Amazon DynamoDB automatically retry requests that receive this exception. So, your request is eventually successful, unless the request is too large or your retry queue is too large to finish.
I've read similar questions (this one, this one and this one) but I'm still confused: is this exception raised if the request is too large or the retry queue is too large to finish (therefore after the automatic retries) or actually before the retries?
Most important: is that the exception I should be expecting in my context? (so I can catch it and retry until auto scale increases the RCU?)
Yes.
Every time your application sends a request that exceeds your capacity you get ProvisionedThroughputExceededException message from Dynamo. However your SDK handles this for you and retries. The default Dynamo retry time starts at 50ms, the default number of retries is 10, and backoff is exponential by default.
This means you get retries at:
50ms
100ms
200ms
400ms
800ms
1.6s
3.2s
6.4s
12.8s
25.6s
If after the 10th retry your request has still not succeeded, the SDK passes the ProvisionedThroughputExceededException back to your application and you can handle it how you like.
You could handle it by increasing throughput provision but another option would be to change the default retry times when you create the Dynamo connection. For example
new AWS.DynamoDB({maxRetries: 13, retryDelayOptions: {base: 200}});
This would mean you retry 13 times, with an initial delay of 200ms. This would give your request a total of 819.2s to complete rather than 25.6s.

Datastax Cassandra Driver Retry Policy Delay?

I am using the Datastax Cassandra driver and have a RetryPolicy setup to retry when a host is unavailable. However, I have noticed that it retries as fast as it can. I would like to change it to have an increasing delay between retries rather than hammer the cluster if it is struggling. This is particularly important for OVERLOADED request errors since I do want to retry in these scenarios, but with a substantial delay.
Where is the right place to put a delay and what is the right mechanism? Should I just throw a Thread.sleep(...) in my RetryPolicy?
I don't mind taking up a request on-the-wire slot (towards the maximum number of in-flight requests) but I am not okay with completely blocking other writes if we are not yet at the in-flight request limit.
You can implement your own retry policy by adding a delay. The simplest way is to pick the source code of the default retry and modify it yourself to implement an exponential delay for retry or something similar.
For exponential delay, just look at the source code of http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/ExponentialReconnectionPolicy.html to see how it works

Resources