ServiceBus RetryExponential Property Meanings - azure

I'm having a hard time understanding the RetryExponential class that is used in conjunction with QueueClients (and I assume SubscriptionClients as well).
The properties are listed here, but I don't think my interpretation of their descriptions is correct.
Here's my interpretation...
var minBackoff = TimeSpan.FromMinutes(5); // wait 5 minutes for the first attempt?
var maxBackoff = TimeSpan.FromMinutes(15); // all attempts must be done within 15 mins?
var deltaBackoff = TimeSpan.FromSeconds(30); // the time between each attempt?
var terminationTimeBuffer = TimeSpan.FromSeconds(90); // the length of time each attempt is permitted to take?
var retryPolicy = new RetryExponential(minBackoff, maxBackoff, deltaBackoff, terminationTimeBuffer, 10);
My worker role has only attempted to processing a message off the queue twice in the past hour even though I think based on the configuration above it should go off more frequently (every 30 seconds + whatever processing time was used during the previous attempted up to 90 seconds). I assume that these settings would force a retry every 2 mins. However, I don't see how this interpretation is Exponential at all.
Are my interpretations for each property (in comments above) correct? If not (and I assume they're not correct), what do each of the properties mean?

As you suspected, the values you included do not make sense for the meaning of these parameters. Here is my understanding of the parameters:
DeltaBackoff - the interval to use to exponentially increment the retry interval by.
MaximumBackoff - The maximum amount of times you want between retries.
MaxRetryCount - the maximum amount of time the system will retry the operation.
MinimalBackoff - the minimum amount of time you want between retries.
TerminationTimeBuffer - the maximum amount of time the system will retry the operation before giving up.
It always will retry up to the maxRetryCount, in your case 10, unless the terminationTimeBuffer limit is hit first.
It will also not try for a period greater than the terminationTimeBuffer, which in your case is 90 seconds, regardless of it hasn't hit the max retry count yet.
The minBackoff is the minimal amount of time you will wait between retries and maxBackoff is the maximum amount of time you want to wait between retries.
The DeltaBackOff value is the value at which each retry internal will grow by exponentially . Note that this isn't an exact time. It randomly chooses a time that is a little less or a little more than this time so that multiple threads all retrying aren't doing so at the exact same time. Its randomness staggers this a little. Between the first actual attempt and the first retry there will be the minBackOff interval only. Since you set your deltaBackOff to 30 seconds, if it made it to a second retry it would be roughly 30 seconds plus the minBackOff. The third retry would be 90 seconds plus the minBackOff, and so on for each retry until it hits the maximum backoff.
One thing I would make sure to point out is that this is a retry policy, meaning if an operation receives an exception it will follow this policy to attempt it again. If operations such as Retrieve, Deadletter, Defer, etc. fail then this retry policy is what will kick in. These are operations against service bus, not exceptions in your own processing.
I could be wrong on this, but my understanding is that this isn't directly tied to the actual receipt of a message for processing unless the call to receive fails. Continuous processing is handled through the Receive method and your own code loop, or through using the OnMessage action (which behind the scenes also uses the Receive). As long as there isn't an error actually attempting to receive then this retry policy doesn't get applied. The interval used between calls to receive is set by either your own use of the Receive method which takes a timespan, or if you set the MessagingFactory.OperationTimeout before creating the queueClient object. If a receive call reaches it's limit of waiting either because you used the overload that provides a timespan on Receive or it hits the default, a Null is simply returned. This is not considered an exception and therefore the retry policy won't kick in.
Sadly, I think you have to code your own exponential back off for actual processing. There are tons of examples out there though.
And yes, you can set this retry policy on both QueueClient and SubscriptionClient.

Related

Occasional duplicate request using jmeter

I'm using JMeter 4.0 trying to create a stress test. The purpose is to emulate the types of requests we receive in production, which is generally an array of requests of different types with a certain frequency and occasionally (1 in 1000) duplicate requests of the same type within milliseconds of each other.
I've managed to create a thread group emulating frequent requests of different types and a second thread group emulating duplicate requests (using synchronizing timer to ensure the requests fire off together).
I'm almost finished. My only problem is that there is no relationship between the thread groups whatsoever. If I wanted to perform a duplicate request once every 1000 requests, I'd need to know how long it takes to perform an average request (which is complicated by the fact that there are several request types) and calculate the time it would require for roughly 1000 requests to be made, and add an appropriate constant timer in the other thread group.
This isn't ideal. I'll settle for this if I must, but I was hoping the bright minds of stackoverflow could shine some insight for my issue.
Some ideas I've had:
Add a run counter which cycles every 1000 normal requests and once run counter hits 1000, I perform a second request (though it would be under the same thread and after I've received the response from the first). Could this be made to work using a synchronized timer?
Use a constant throughput timer with "all active threads (shared)" set whose samples per minutes is set to 1000.
Is there a better way still? The actual requests are HTTP requests, though there are several steps prior in preparation of the message to send. I'm already using a constant throughput timer in the first thread group (random service requests) to maintain a specific amount of requests per minute, so I'm not sure if adding a second constant throughput timer in the other thread group would create issues.
Thank you for your time.
You can add If Controller with condition of 1 every 1000 threads
${__jexl3(${__threadNum} % 1000 == 0)}
and inside If Controller execute your duplicate HTTP Request
__threadNum return current thread/user number

Autotmatically renewing locks correctly on Azure Service Bus

So i'm trying to understand service bus timings... Especially how the locks works. One can choose to manually call CompleteAsync which is what we're doing. It could also be the case that the processing takes some time. In these cases we want to make sure we don't get unneccessary MessageLockLostException.
Seems there are a couple of numbers to relate to:
Lock duration (found in azure portal on the bus, currently set to 1 minute which is think is default)
AutoRenewTimeout (property on OnMessageOptions, currently set to 1 minute)
AutoComplete (property on OnMessageOptions, currently set to false)
Assuming the processing is running for around 2 minutes, and then either succeeds or crases (doesn't matter which case for now). Let's say this is the normal scenario, so this means that processing takes roughly 2 minutes for each message.
Also, it's indeed a queue and not a topic. And we only have one consumer that asynchronoulsy processes the messages with MaxConcurrentCalls set to 100. We're using OnMessageAsync with ReceiveMode.PeekLock.
What should my settings now be as a single consumer to robustly process all messages?
I'm thinking that leaving Lock duration to 1 minute would be fine, as that's the default, and set my AutoRenewTimeout to 5 minutes for safety, because as i've understood this value should be the maximum time it takes to process a message (atleast according to this answer). Performance is not critical for this system, so i'm resonating as that leaving a message locked for some unneccessary 1, 2 or 3 minutes is not evil, as long as we don't get LockedException because these give no real value.
This thread and this thread gives great examples of how to manually renew the locks, but I thought there is a way to automatically renew the locks.
What should my settings now be as a single consumer to robustly process all messages?
Aside from LockDuration, MaxConcurrentCalls, AutoRenewTimeout, and AutoComplete there are some configurations of the Azure Service Bus client you might want to look into. For example, create not a single client with MaxConcurrentCalls set to 100, but a few clients with total concurrency level distributed among the clients. Note that you'd want to use different MessagingFactory instances to create those clients to ensure you have more than a single "pipe" to receive messages. And even with that, it would be way better to scale out and have competing consumers rather than having a single consumer handling all the load.
Now back to the settings. If your normal processing time is 2 minutes, it's better to set MaxLockDuration on the entities to this time and not 1 minute. This will remove unnecessary lock extension calls to the broker and eliminate MessageLockLostException.
Also, keep in mind that AutoRenewTimeout is a client based operation, not broker, and therefore not guaranteed. You will run into cases where lock will be lost even though the AutoRenewTimeout time has not elapsed yet.
AutoRenewTimeout should always be set to longer than MaxLockDuration as it will be counterproductive to have them equal. Have it somewhat larger than MaxLockDuration as this is clients' "insurance" that when processing takes longer than MaxLockDuration, message lock won't be lost. Having those two equal is, in essence, disables this fallback.

azure function max execution time

I would like to have a function called on a timer (every X minutes) but I want to ensure that only one instance of this function is running at a time. The work that is happening in the function shouldn't take long, but if for some reason it takes longer than the scheduled timer (X minutes) I don't want another instance to start and the processes to step on each other.
The simplest way that I can think of would be to set a maximum execution time on the function to also be X minutes. I would want to know how to accomplish this in both the App Service and Consumption plans, even if they are different approaches. I also want to be able to set this on an individual function level.
This type of feature is normally built-in to a FaaS environment, but I am having the hardest time google-binging it. Is this possible in the function.json? Or also are there different ways to make sure that this runs only once?
(PS. I know I could this in my own code by wrapping the work in a thread with a timeout. But I was hoping for something more idiomatic.)
Timer functions already have this behavior - they take out a blob lease from the AzureWebJobsStorage storage account to ensure that only one instance is executing the timer function. Also, the timer will not execute while a previous scheduled execution is in flight.
Another roll-your-own possibility is to handle this with storage queues and visibility timeout - when the queue has finished processing, push a new queue message with visibility timeout to match the desired schedule.
I want to mention that the functionTimeout host.json property will add a timeout to all of your functions, but has the side effect that your function will fail with a timeout error and that function instance will restart, so I wouldn't rely on it in this case.
You can specify 'functionTimeout' property in host.json
https://github.com/Azure/azure-webjobs-sdk-script/wiki/host.json
// Value indicating the timeout duration for all functions.
// In Dynamic SKUs, the valid range is from 1 second to 10 minutes and the default value is 5 minutes.
// In Paid SKUs there is no limit and the default value is null (indicating no timeut).
"functionTimeout": "00:05:00"
There is a new Azure Functions plan called Premium (in public preview as of May 2019) that allows for unlimited execution duration:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale
It will probably end up the goto plan for most Enterprise scenarios.

How to set the number of retries for Azure DocumentDB output binding in Azure Function?

Based on this question, it seems like writing to Azure DocDB output binding in Azure Function will be retried 10 times if throttled (HTTP 429). I haven't verified this myself though.
I would like to increase this limit on the number of retries. My data comes in big chunks in a small amount of time and then with a very long period of downtime, which means that getting 429 and waiting for a bit is okay for my purpose. I must guarantee though, that no data is dropped.
One way for me to solve this is to increase the RTU limit in Document DB to make sure I don't get 429 during the time big chunks of data come in, but it's already at about 2.5 times of what I need during downtime period. Is there anyway to make the retries run infinitely until it succeeds, or less ideally, increase the number of retries to something more than 10?
Why don't you change the approach and instead of inserting documents right away you can make use of service bus and implement a dead letter queue, here are some links:
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-dead-letter-queues
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-service-bus
https://blog.jeroenmaes.eu/2017/01/process-service-bus-dead-letter-message-with-azure-functions/
The idea is having something like this:
Current function instead of saving the data in DocumentDB, it will be sending it the the service bus (you just change the output binding)
Another function will process every message of the service bus and if it failed (you can manage a timeout in the function and then move the message to a dead letter queue)
Another function that will process any message in the dead letter queue
You just need to make a small change in the first function and create two more, might sound too complicated but you'll have strong consistency in the data. In all of the above links there's an example of what I mentioned here.

Why does it take so long to get through first level retries?

I've just started playing around with NServiceBus on Azure, and for some reason it takes a long time to get through the first level retries when a message handler throws an exception. With retries set to 5 it takes 20+ minutes before the second level retries kick in.
What is causing the delay?
Here's how I'm configuring the bus:
Configure.Transactions.Advanced(s =>
{
s.DisableDistributedTransactions();
s.DoNotWrapHandlersExecutionInATransactionScope();
});
Configure.With()
.AutofacBuilder(container)
.DefiningCommandsAs(t => t.IsCommand())
.DefiningEventsAs(t => t.IsEvent())
.XmlSerializer()
.MessageForwardingInCaseOfFault()
.AzureConfigurationSource()
.UseTransport<AzureStorageQueue>()
.AzureDiagnosticsLogger()
.AzureMessageQueue()
.AzureSubcriptionStorage()
.UseAzureTimeoutPersister()
.UnicastBus()
.RunHandlersUnderIncomingPrincipal(false);
FYI: I'm using NServiceBus built from the develop branch as of today and running in the emulator.
Oh, I misread the question, I thought it was taking 20 minutes after last retry for the second level to kick in. But than I know what this is and it's configurable!
To support batching (to lower the cost) the message visible time is calculated by multiplying the individual MessageInvisibleTime by the amount in the BatchSize, the default MessageInvisibleTime is 30000 (milliseconds), the default BatchSize is 10. Multiply that again with 5 first level retries and you'll end up with 25 minutes before the first exception occurs and the second level to kick in.
You can reconfigure this if you like: MessageInvisibleTime and BatchSize is a property on the AzureQueueConfig and MaxRetries sits on TransportConfig (in 4.0) or MsmqTransportConfig (in 3.X)
Can you open an issue on github for this, with repro if possible? on http://www.github.com/nservicebus/nservicebus
I suspect the delay comes from the azure timeout persister as that is the one responsible for managing the time between retries, yet 20 minutes seems like a really odd number so have no immediate explanation for the observed behavior.
In the mean time, can you try using the in memory timeoutpersister and see if the issue disappears, that would confirm my hypethesis.
I was under the impression that first level retries did not need a timeoutpersister (was not even aware that of its existence to be honest) and that first level retries were only driven by the peek lock/invisible time of messages in the Azure queue.
For second level retries I would expect the timeoutpersister to play a role (now that I know it exists...).
Yves, correct me if I am wrong.

Resources