When i read the documentation about visibilityTimeout: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-queue#host-json it says "The time interval between retries when processing of a message fails.". How I understands this is that if the timeout is set to 30 seconds and my function runs for 1 minute but doesn't fail in that 1 minute period, the message doesn't get visible to others in the queue. But when I read up on it by others sources (stackoverflow fx) it tells me the opposite, that when the execution time of the function exceeds the timeout, the message becomes visible EVEN though the function is still processing the message.
What is the truth? Is the timeout only relevant when the function isn't running more (and maybe have failed) or can it happen that the message gets visible again even though the function is still running?
What doesn't makes sense either, if we assume that the message gets visible when timeout is reached, is that the default timeout is 00:00:00 which implies that the message is visible at the same moment it is dequeued. This contradicts what 3. party sources is saying.
I am a bit confused by this.
It appears there are actually two different visibility timeout values used here. Both are set by the Azure WebJobs SDK but only one is configurable.
When the function fails
The queues.visibilityTimeout configuration option would be more aptly named retryDelay.
When the function throws an exception or fails for some other kind of error, the message gets returned to the queue to be retried. The message is returned with the configured visibilityTimeout (see here), which delays when the function will next attempt to run.
This allows your application to cope with transient errors. For example, if an email API or other external service is temporarily down. By delaying the retry, there is a chance that the service may be back online for the next function attempt.
Retry is limited to maxDequeueCount attempts (5 default) before a message is moved to the Poison Queue.
While the function is running
When the QueueTrigger binding runs the function, it dequeues the message with a visibility timeout of 10mins (hard-coded here). It then sets a timer to extend the visibility window when it reaches half-time as long as the function is running (see the timer and visibility update in the source).
Ordinarily you don't need to worry about this as long as your functions use CancellationTokens correctly. This 10min timeout only matters if the Azure Function/WebJob host doesn't get to shut down gracefully. For example:
someone "pulls the plug" on the web host
if the function doesn't respond to the CancellationToken in time during scale-in or other Azure shutdown events
So, as long as the function is still running, the message will remain hidden from the queue.
Verification
I did a similar experiment to check:
[FunctionName("SlowJob")]
public async Task Run(
[QueueTrigger("slow-job-queue")] CloudQueueMessage message,
ILogger log)
{
for (var i = 0; i < 20; i++)
{
log.LogInformation($"Next visible {i}: {message.NextVisibleTime}");
await Task.Delay(60000);
}
}
Output:
Next visible 0: 5/11/2020 7:49:24 +00:00
Next visible 1: 5/11/2020 7:49:24 +00:00
Next visible 2: 5/11/2020 7:49:24 +00:00
Next visible 3: 5/11/2020 7:49:24 +00:00
Next visible 4: 5/11/2020 7:49:24 +00:00
Next visible 5: 5/11/2020 7:54:24 +00:00
Next visible 6: 5/11/2020 7:54:24 +00:00
Next visible 7: 5/11/2020 7:54:24 +00:00
Next visible 8: 5/11/2020 7:54:24 +00:00
Next visible 9: 5/11/2020 7:54:24 +00:00
Next visible 10: 5/11/2020 7:59:24 +00:00
Next visible 11: 5/11/2020 7:59:24 +00:00
...
I have tested this with
using System;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Host;
using Microsoft.Extensions.Logging;
using Microsoft.WindowsAzure.Storage.Queue;
namespace WorkerFunctions
{
public static class WorkerFunctions
{
[FunctionName("WorkerFunction1")]
public static async Task Function1(
[QueueTrigger("outputQueue")]
CloudQueueMessage item,
[Queue("outputQueue")]
CloudQueue outputQueue,
DateTimeOffset nextVisibleTime,
DateTimeOffset expirationTime,
DateTimeOffset insertionTime,
ILogger log)
{
log.LogInformation("########## Function 1 ###############");
log.LogInformation($"NextVisibleTime: {nextVisibleTime}");
log.LogInformation($"NextVisibleTime: {(nextVisibleTime-insertionTime).TotalSeconds}");
log.LogInformation($"C# Queue trigger function processed: {item.AsString}");
Thread.Sleep(TimeSpan.FromMinutes(20));
}
[FunctionName("WorkerFunction2")]
public static async Task Function2(
[QueueTrigger("outputQueue")]
CloudQueueMessage item,
[Queue("outputQueue")]
CloudQueue outputQueue,
DateTimeOffset nextVisibleTime,
DateTimeOffset expirationTime,
DateTimeOffset insertionTime,
ILogger log)
{
log.LogInformation("########## Function 2 ###############");
log.LogInformation($"NextVisibleTime: {nextVisibleTime}");
log.LogInformation($"NextVisibleTime: {(nextVisibleTime - insertionTime).TotalSeconds}");
log.LogInformation($"C# Queue trigger function processed: {item.AsString}");
Thread.Sleep(TimeSpan.FromMinutes(20));
}
}
}
With this host file
{
"version": "2.0",
"extensions": {
"queues": {
"maxPollingInterval": "00:00:02",
"visibilityTimeout": "00:00:10",
"batchSize": 16,
"maxDequeueCount": 5,
"newBatchThreshold": 8
}
}
}
And when i put a simple message on the queue and let it run, I see the following:
the function that grabs it, doesn't release it before the sleep is over
i can't see it in logs that the lease is renewed, but it seems like it happens under the hood
What this tells me:
if the function doesn't fail, OR the host doesn't fail, well then the lease is autorenewed according to: https://stackoverflow.com/a/31883806/21199
when the visibility timeout is reached, and the function is running, the message doesn't get "readded" to the queue
that the documentation about the visibilityTimeout is true: "The time interval between retries when processing of a message fails." (from https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-queue#hostjson-settings)
I haven't saved any links to 3. party that contradicted this (sorry I haven't saved these), but they exists. I wish someone will answer this, so I can get clarification.
Related
I have an azure app service that reads from a azure storage queue through camel route version 3.14.0. Below is my code:
queue code:
QueueServiceClient client = new QueueServiceClientBuilder()
.connectionString(storageAccountConnectionString)
.buildClient();
getContext().getRegistry().bind("client", client);
errorHandler(deadLetterChannel(SEND_TO_POISON_QUEUE)
.useOriginalBody()
.log("Message sent to poison queue for handling")
.retryWhile(method(new RetryRuleset(), "shouldRetry"))
.maximumRedeliveries(24)
.asyncDelayedRedelivery()
.redeliveryDelay(3600 * 1000L) // initial delay
);
// Route to retrieve a message from storage queue.
from("azure-storage-queue:" + storageAccountName + "/" + QUEUE_NAME + "?serviceClient=#client&maxMessages=1&visibilityTimeout=P2D")
.id(QUEUE_ROUTE_CONSUMER)
.log("Message received from queue with messageId: ${headers.CamelAzureStorageQueueMessageId} and ${headers.CamelAzureStorageQueueInsertionTime} in UTC")
.bean(cliFacilityService, "processMessage(${body}, ${headers.CamelAzureStorageQueueInsertionTime})")
.end();
RetryRuleset code:
public boolean shouldRetry(#Header(QueueConstants.MESSAGE_ID) String messageId,
#Header(Exchange.REDELIVERY_COUNTER) Integer counter,
#Header(QueueConstants.INSERTION_TIME) OffsetDateTime insertionTime) {
OffsetDateTime futureRetryOffsetDateTime = OffsetDateTime.now(Clock.systemUTC()).plusHours(1); //because redelivery delay is 1hr
OffsetDateTime insertionTimePlus24hrs = insertionTime.plusHours(24);
if (futureRetryOffsetDateTime.isAfter(insertionTimePlus24hrs)) {
log.info("Facility queue message: {} done retrying because next time to retry {}. Redelivery count: {}, enqueue time: {}",
messageId, futureRetryOffsetDateTime, counter, insertionTime);
return false;
}
return true;
}
the redeliveryDelay is 1hr and maximumRedeliveries is 24, because i want to try once an hour for about 24 hrs. so not necessarily needs to be 24 times, just as many as it can do with 24hrs. and if it passes 24hrs, send it to the poison queue (this code is in the retry ruleset)
The problem is the app service retrying for first lets say 2 - 5 times normally once an hour. but after that the app service retries after 2 days later. So the message is expired and not retried because of the ruleset and sent to poison queue. Sometimes the app service does the first read from queue and the next retry is after 2 days. so very unstable. so total it is retrying tops 1-10 times and the last retry is always 2 days later in the same time from the first retry.
Is there anything i am doing wrong?
Thank you for you help!
We are using queue trigger based function app on premium plan where messages contains some details like azure subscriptions name. Based on which for each subscription we do many api calls specially to azure storage accounts(around 400 to 500). Since 'list' api call to storage account is limited to 100 call/5min, we get 429 response error on 101th call. To mitigate this we have applied exponential retry logic(tried both our own or polly library) which call after certain delay of time. This works for some subscription but fails for many where the retry logic does not try after first trying(we kept 3 retries with 60 sec delay). Even while monitoring the function app through live metrics we observed that sometimes cpu usage of some function instance goes to zero(although we do some operation like logging or use for loop in delay operation so that the function can be alive) which leads to killing of that particular function instance and pushing the message back to queue and start the process again with a fresh instance.
Note that since many subscription are processed in parallel, function app automatically scale up as required. Also since we are using premium plan one VM is always on state. So killing of any instance(which call around 400 to 500 storage api call for any particular subscription) is weird since in our delay the thread sleep time is only 10 sec for around 6,12,18(Time_delay) iteration. The below delay function is used in our retry logic code.
private void Delay(int Time_delay, string requestUri, int retryCount)
{
for (int i = 0; i < Time_delay; i++)
{
_logger.LogWarning($"Sleep initiated for id: {requestUri.ToString()}, RetryCount: {retryCount} CurrentTimeDelay: {Time_delay}");
Thread.Sleep(10000);
_logger.LogWarning($"Sleep completed for id: {requestUri.ToString()}, RetryCount: {retryCount} CurrentTimeDelay: {Time_delay}");
}
}
Note** Function app is not throwing any other exception other than dependency of 429 error response.
Would it be possible for you to requeue instead of using Thread.Sleep? You can use initial visibility delay when requeuing:
public class Function1
{
[FunctionName(nameof(TryDoWork))]
public static async Task TryDoWork(
[QueueTrigger("some-queue")] SomeItem item,
[Queue("some-queue")] CloudQueue queue)
{
var result = _SomeService.SomeWork(item);
if (result == 429)
{
item.Retries++;
var json = JsonConvert.SerializeObject(item);
var message = new CloudQueueMessage(json);
var delay = TimeSpan.FromSeconds(item.Retries);
await queue.AddMessageAsync(message, null, delay, null, null);
}
}
}
It might be that the sleeping is causing some wonky function app behavior. I think I remember reading some issues pertaining to the usage of Thread.Sleep, but I can't find it right now.
Also, you might want to add some sort of handling of messages that end up retrying more than 3 times (or however many you think is reasonable).
I came across the link below, and have questions:
https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-checkpointing-and-replay
1 When an OrchestrationTrigger Durable Function is invoked, and crashes for some reasons (e.g. after max timeout duration of 10 mins), will the inputs, names, below be read from table storage or queue automatically.
[FunctionName("E1_HelloSequence")]
public static async Task<List<string>> Run(
[OrchestrationTrigger] DurableOrchestrationContext context)
{
var names= ctx.GetInput<List<string>>();
var outputs = new List<string>();
outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", names[0]));
outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", names[1]));
// returns ["Hello Tokyo!", "Hello Seattle!"]
return outputs;
}
2 After it crashes, will it re-start automatically.
3 At each await, the function transits into wait status, does the wait period contribute to part of max timeout duration?
Hi as Chris from Function Product Group is already involved with you on GitHub Thread.
Posting it here so that it is beneficial for other members as well.
1) Yes, the results of any executed activity function will be read from table storage.
2) Yes, the function will retry automatically. An existing queue message ensures this.
3) No, time spent awaiting does not count against your max function timeout. Nor are you
billed for time spent awaiting.
I have implemented a service bus trigger, my sample code is as below.
public static async Task Run([ServiceBusTrigger("myqueue", Connection =
"myservicebus:cs")]BrokeredMessage myQueueItem, TraceWriter log)
{
try
{
if (myQueueItem.LockedUntilUtc <= DateTime.UtcNow)
{
log.Info($"Lock expired.");
return;
}
//renew lock if message lock is about to expire.
if ((myQueueItem.LockedUntilUtc - DateTime.UtcNow).TotalSeconds <= defaultLock)
{
await myQueueItem.RenewLockAsync();
return true;
}
//Process message logic
await myQueueItem.CompleteAsync();
}
catch (MessageLockLostException ex)
{
log.Error($"Message Lock lost. Exception: {ex}");
}
catch(CustomException ex)
{
//forcefully dead letter if custom exception occurs
await myQueueItem.DeadLetterAsync();
}
catch (Exception ex)
{
log.Error($"Error occurred while processing message. Exception: {ex}");
await myQueueItem.AbandonAsync();
}
}
I have set the default lock duration on the queue as 5 minutes.
I'm getting message lock lost exception for few requests, even though lock was actually not expired.
Request process timings as below:
Service bus trigger fired at: May 07 07:02:14 +00:00 2018 utc
LockedUntilUtc in Brokered message : May 07 07:07:08.0149905 utc
Message lock lost exception occurred at : May 07 07:02:18 +00:00 2018 utc
Can anybody help me to find actually what is wrong with my code.
Thanks.
You should not explicitly call CompleteAsync and AbandonAsync in your code. Those methods will be called by Azure Functions runtime automatically based on the result of your function execution (Complete if no exception occurred, Abandon otherwise).
Renewing the lock manually shouldn't be necessary either, runtime should manage that for you. If you are running on Consumption plan, the max duration of function execution is 5 minutes (by default) anyway.
Try removing all the plumbing code and leave only //Process message logic, see if that helps.
I have a simple function that takes a message from a queue and saves it to a storage table. I expect that in some cases a table entity with the same data can already exist. Because of that, I added an exception handling to skip this type of situation and mark the queue message as processed. Despite the fact that exception is handled now, the scripthost informs me about an error and the message is still in the queue.
I suppose it is caused by the fact that I'm using table binding that is on edge between host and my code. Am I right? Should I use a table client within my code instead of binding? Is there a different approach?
Sample code to generate this situation:
[FunctionName("MyFunction")]
public static async Task Run([QueueTrigger("myqueue", Connection = "Conn")]string msg, [Table("mytable", Connection = "Conn")] IAsyncCollector<DataEntity> dataEntity, TraceWriter log)
{
try
{
await dataEntity.AddAsync(new DataEntity()
{
PartitionKey = "1",
RowKey = "1",
Data = msg
});
await dataEntity.FlushAsync();
}
catch (StorageException e)
{
// when it is an exception that informs "entity already exists" skip it
}
}
When a queue trigger function fails, Azure Functions retries the function up to five times for a given queue message, including the first try.
If all five attempts fail, the functions runtime adds a message to a queue named <originalqueuename>-poison.
You can write a function to process messages from the poison queue by logging them or sending a notification that manual attention is needed.
The host.json file contains settings that control queue trigger behavior:
{
"queues": {
"maxPollingInterval": 2000,
"visibilityTimeout" : "00:00:30",
"batchSize": 16,
"maxDequeueCount": 1,
"newBatchThreshold": 8
}
}
Note: maxDequeueCount default is 5. The number of times to try processing a message before moving it to the poison queue. For your need, you could set the "maxDequeueCount":1.
Also these settings are host wide and apply to all functions. You can't control these per function currently.