Durable Task Framework re-queue failed task - azure

How to use "waiting for external" event functionality of durable task framework in the code. Following is a sample code.
context.ScheduleWithRetry<LicenseActivityResponse>(
typeof(LicensesCreatorActivity),
_retryOptions,
input);
I am using ScheduleWithRetry<> method of context for scheduling my task on DTF but when there is an exception occurring in the code. The above method retries for the _retryOptions number of times.
After completing the retries, the Orchestration status will be marked as Failed.
I need a process by which i can resume my orchestration on DTF after correcting the reason of exception.
I am looking into the githib code for the concerned method in the code but no success.
I have concluded two solution:
Call a framework's method (if exist) and re-queue the orchestration from the state where it failed.
Hold the orchestration code in try catch and in catch section i implement a method CreateOrchestrationInstanceWithRaisedEventAsync whcih will put the orchestration in hold state until an external event triggers it back. Whenever a user (using some front end application) will call the external event for resuming (which means the user have made the corrections which were causing exception).
These are my understandings, if one of the above is possible then kindly guide me through some technical suggestions. otherwise find me a correct path for this task.

For the community's benefit, Salman resolved the issue by doing the following:
"I solved the problem by creating a sub orchestration in case of an exception occurs while performing an activity. The sub orchestration lock the event on azure as pending state and wait for an external event which raise the locked event so that the parent orchestration resumes the process on activity. This process helps if our orchestrations is about to fail on azure durable task framework"

I have figured out the solution for my problem by using "Signal Orchestrations" taken from code from GitHub repository.
Following is the solution diagram for the problem.
In this diagram, before the solution implemented, we only had "Process Activity" which actually executes the activity.
Azure Storage Table is for storing the multiplier values of an instanceId and ActivityName. Why we implemented this will get clear later.
Monitoring Website is the platform from where a user can re-queue/retry the orchestration activity to perform.
Now we have a pre-step and a post-step.
1. Get Retry Option (Pre-Step)
This method basically set the value of RetryOptions instance value.
private RetryOptions ModifyMaxRetires(OrchestrationContext context, string activityName)
{
var failedInstance =
_azureStorageFailedOrchestrationTasks.GetSingleEntity(context.OrchestrationInstance.InstanceId,
activityName);
var configuration = Container.GetInstance<IConfigurationManager>();
if (failedInstance.Result == null)
{
return new RetryOptions(TimeSpan.FromSeconds(configuration.OrderTaskFailureWaitInSeconds),
configuration.OrderTaskMaxRetries);
}
var multiplier = ((FailedOrchestrationEntity)failedInstance.Result).Multiplier;
return new RetryOptions(TimeSpan.FromSeconds(configuration.OrderTaskFailureWaitInSeconds),
configuration.OrderTaskMaxRetries * multiplier);
}
If we have any entry in our azure storage table against the instanceId and ActivityName, we takes the multiplier value from the table and updates the value of retry number in RetryOption instance creation. otherwise we are using the default number of retry value which is coming from our config.
Then:
We process the activity with scheduled retry number (if activity fails in any case).
2. Handle Exceptions (Post-Step)
This method basically handles the exception in case of the activity fails to complete even after the number of retry count set for the activity in RetryOption instance.
private async Task HandleExceptionForSignal(OrchestrationContext context, Exception exception, string activityName)
{
var failedInstance = _azureStorageFailedOrchestrationTasks.GetSingleEntity(context.OrchestrationInstance.InstanceId, activityName);
if (failedInstance.Result != null)
{
_azureStorageFailedOrchestrationTasks.UpdateSingleEntity(context.OrchestrationInstance.InstanceId, activityName, ((FailedOrchestrationEntity)failedInstance.Result).Multiplier + 1);
}
else
{
//const multiplier when first time exception occurs.
const int multiplier = 2;
_azureStorageFailedOrchestrationTasks.InsertActivity(new FailedOrchestrationEntity(context.OrchestrationInstance.InstanceId, activityName)
{
Multiplier = multiplier
});
}
var exceptionInput = new OrderExceptionContext
{
Exception = exception.ToString(),
Message = exception.Message
};
await context.CreateSubOrchestrationInstance<string>(typeof(ProcessFailedOrderOrchestration), $"{context.OrchestrationInstance.InstanceId}_{Guid.NewGuid()}", exceptionInput);
}
The above code first try to find the instanceID and ActivityName in azure storage. If it is not there then we simply add a new row in azure storage table for the InstanceId and ActivityName with the default multiplier value 2.
Later on we creates a new exception type instance for sending the exception message and details to sub-orchestration (which will be shown on monitoring website to a user). The sub-orchestration waits for the external event fired from a user against the InstanceId of the sub-orchestration.
Whenever it is fired from monitoring website, the sub-orchestration will end up and go back to start parent orchestration once again. But this time, when the Pre-Step activity will be called once again it will find the entry in azure storage table with a multiplier. Which means the retry options will get updated after multiplying it with default retry options.
So by this way, we can continue our orchestrations and prevent them from failing.
Following is the class of sub-orchestrations.
internal class ProcessFailedOrderOrchestration : TaskOrchestration<string, OrderExceptionContext>
{
private TaskCompletionSource<string> _resumeHandle;
public override async Task<string> RunTask(OrchestrationContext context, OrderExceptionContext input)
{
await WaitForSignal();
return "Completed";
}
private async Task<string> WaitForSignal()
{
_resumeHandle = new TaskCompletionSource<string>();
var data = await _resumeHandle.Task;
_resumeHandle = null;
return data;
}
public override void OnEvent(OrchestrationContext context, string name, string input)
{
_resumeHandle?.SetResult(input);
}
}

Related

Order Process with Azure Durable Functions or not

I am creating an architecture to process our orders from an ecommerce website who gets 10,000 orders or more every hour. We are using an external third party order fulfillment service and they have about 5 Steps/APIs that we have to run which are dependent upon each other.
I was thinking of using Fan in/Fan Out approach where we can use durable functions.
My plan
Once the order is created on our end, we store in a table with a flag of Order completed.
Run a time trigger azure function that runs the durable function orchestrator which calls the activity functions for each step
Now if it fails, timer will pick up the order again until it is completed. But my question is should we put this order in service bus and pick it up from there instead of time trigger.
Because there can be more than 10,000 records each hour so we have to run a query in the time trigger function and find orders that are not completed and run the durable orchestrator 10,000 times in a loop. My first question - Can I run the durable function parallelly for 10,000 records?
If I use service bus trigger to trigger durable orchestrator, it will automatically run azure function and durable 10,000 times parallelly right? But in this instance, I will have to build a dead letter queue function/process so if it fails, we are able to move it to active topic
Questions:
Is durable function correct approach or is there a better and easier approach?
If yes, Is time trigger better or Service bus trigger to start the orchestrator function?
Can I run the durable function orchestrator parallelly through time trigger azure function. I am not talking about calling activity functions because those cannot be run parallelly because we need output of one to be input of the next
This usecase fits function chaining. This can be done by
Have the ordering system put a message on a queue (storage or servicebus)
Create an azure function with storage queue trigger or service bus trigger. This would also be the client function that triggers the orchestration function
Create an orchestration function that invokes the 5 step APIs, one activity function for each (similar to as given in function chaining example.
Create five activity function, one f for each API
Ordering system
var clientOptions = new ServiceBusClientOptions
{
TransportType = ServiceBusTransportType.AmqpWebSockets
};
//TODO: Replace the "<NAMESPACE-NAME>" and "<QUEUE-NAME>" placeholders.
client = new ServiceBusClient(
"<NAMESPACE-NAME>.servicebus.windows.net",
new DefaultAzureCredential(),
clientOptions);
sender = client.CreateSender("<QUEUE-NAME>");
var message = new ServiceBusMessage($"{orderId}");
await sender.SendMessageAsync(message);
Client function
public static class OrderFulfilment
{
[Function("OrderFulfilment")]
public static string Run([ServiceBusTrigger("<QUEUE-NAME>", Connection = "ServiceBusConnection")] string orderId,
[DurableClient] IDurableOrchestrationClient starter)
{
var logger = context.GetLogger("OrderFulfilment");
logger.LogInformation(orderId);
return starter.StartNewAsync("ChainedApiCalls", orderId);
}
}
Orchestration function
[FunctionName("ChainedApiCalls")]
public static async Task<object> Run([OrchestrationTrigger] IDurableOrchestrationContext fulfillmentContext)
{
try
{
// .... get order with orderId
var a = await context.CallActivityAsync<object>("ApiCaller1", null);
var b = await context.CallActivityAsync<object>("ApiCaller2", a);
var c = await context.CallActivityAsync<object>("ApiCaller3", b);
var d = await context.CallActivityAsync<object>("ApiCaller4", c);
return await context.CallActivityAsync<object>("ApiCaller5", d);
}
catch (Exception)
{
// Error handling or compensation goes here.
}
}
Activity functions
[FunctionName("ApiCaller1")]
public static string ApiCaller1([ActivityTrigger] IDurableActivityContext fulfillmentApiContext)
{
string input = fulfillmentApiContext.GetInput<string>();
return $"API1 result";
}
[FunctionName("ApiCaller2")]
public static string ApiCaller2([ActivityTrigger] IDurableActivityContext fulfillmentApiContext)
{
string input = fulfillmentApiContext.GetInput<string>();
return $"API2 result";
}
// Repeat 3 more times...

Event Hub Consumer in Service Fabric

I'm trying to get a service fabric to consistently pull messages from an azure event hub. I seem to have everything wired up but notice that my consumer just stops pulling events.
I have a hub with a couple thousand events I've pushed to it. Configured the hub with 1 partition and have my service fabric service with also only 1 partition to ease debugging.
Service starts, creates the EventHubClient, from there uses it to create a PartitionReceiver. The receiver is passed to an "EventLoop" that enters an "infinite" while that calls receiver.ReceiveAsync. The code for the EventLoop is below.
What I am observing is the first time through the loop I almost always get 1 message. Second time through I get somewhere between 103 and 200ish messages. After that, I get no messages. Also seems like if I restart the service, I get the same messages again - but that's because when I restart the service I'm having it start back at the beginning of the stream.
Would expect this to keep running until my 2000 messages were consumed and then it would wait for me (polling ocassionally).
Is there something specific I need to do with the Azure.Messaging.EventHubs 5.3.0 package to make it keep pulling events?
//Here is how I am creating the EventHubClient:
var connectionString = "something secret";
var connectionStringBuilder = new EventHubsConnectionStringBuilder(connectionString)
{
EntityPath = "NameOfMyEventHub"
};
try
{
m_eventHubClient = EventHubClient.Create(connectionStringBuilder);
}
//Here is how I am getting the partition receiver
var receiver = m_eventHubClient.CreateReceiver("$Default", m_partitionId, EventPosition.FromStart());
//The event loop which the receiver is passed to
private async Task EventLoop(PartitionReceiver receiver)
{
m_started = true;
while (m_keepRunning)
{
var events = await receiver.ReceiveAsync(m_options.BatchSize, TimeSpan.FromSeconds(5));
if (events != null) //First 2/3 times events aren't null. After that, always null and I know there are more in the partition/
{
var eventsArray = events as EventData[] ?? events.ToArray();
m_state.NumProcessedSinceLastSave += eventsArray.Count();
foreach (var evt in eventsArray)
{
//Process the event
await m_options.Processor.ProcessMessageAsync(evt, null);
string lastOffset = evt.SystemProperties.Offset;
if (m_state.NumProcessedSinceLastSave >= m_options.BatchSize)
{
m_state.Offset = lastOffset;
m_state.NumProcessedSinceLastSave = 0;
await m_state.SaveAsync();
}
}
}
}
m_started = false;
}
**EDIT, a question was asked on the number of partitions. The event hub has a single partition and the SF service also has a single one.
Intending to use service fabric state to keep track of my offset into the hub, but that's not the concern for now.
Partition listeners are created for each partition. I get the partitions like this:
public async Task StartAsync()
{
// slice the pie according to distribution
// this partition can get one or more assigned Event Hub Partition ids
string[] eventHubPartitionIds = (await m_eventHubClient.GetRuntimeInformationAsync()).PartitionIds;
string[] resolvedEventHubPartitionIds = m_options.ResolveAssignedEventHubPartitions(eventHubPartitionIds);
foreach (var resolvedPartition in resolvedEventHubPartitionIds)
{
var partitionReceiver = new EventHubListenerPartitionReceiver(m_eventHubClient, resolvedPartition, m_options);
await partitionReceiver.StartAsync();
m_partitionReceivers.Add(partitionReceiver);
}
}
When the partitionListener.StartAsync is called, it actually creates the PartitionListener, like this (it's actually a bit more than this, but the branch taken is this one:
m_eventHubClient.CreateReceiver(m_options.EventHubConsumerGroupName, m_partitionId, EventPosition.FromStart());
Thanks for any tips.
Will
How many partition do you have? I can't see in your code how you make sure you read all partitions in the default consumer group.
Any specific reason why you are using PartitionReceiver instead of using an EventProcessorHost?
To me, SF seems like a perfect fit for using the event processor host. I see there is already a SF integrated solution that uses stateful services for checkpointing.

Queue triggered based function app not getting completed

We are using queue trigger based function app on premium plan where messages contains some details like azure subscriptions name. Based on which for each subscription we do many api calls specially to azure storage accounts(around 400 to 500). Since 'list' api call to storage account is limited to 100 call/5min, we get 429 response error on 101th call. To mitigate this we have applied exponential retry logic(tried both our own or polly library) which call after certain delay of time. This works for some subscription but fails for many where the retry logic does not try after first trying(we kept 3 retries with 60 sec delay). Even while monitoring the function app through live metrics we observed that sometimes cpu usage of some function instance goes to zero(although we do some operation like logging or use for loop in delay operation so that the function can be alive) which leads to killing of that particular function instance and pushing the message back to queue and start the process again with a fresh instance.
Note that since many subscription are processed in parallel, function app automatically scale up as required. Also since we are using premium plan one VM is always on state. So killing of any instance(which call around 400 to 500 storage api call for any particular subscription) is weird since in our delay the thread sleep time is only 10 sec for around 6,12,18(Time_delay) iteration. The below delay function is used in our retry logic code.
private void Delay(int Time_delay, string requestUri, int retryCount)
{
for (int i = 0; i < Time_delay; i++)
{
_logger.LogWarning($"Sleep initiated for id: {requestUri.ToString()}, RetryCount: {retryCount} CurrentTimeDelay: {Time_delay}");
Thread.Sleep(10000);
_logger.LogWarning($"Sleep completed for id: {requestUri.ToString()}, RetryCount: {retryCount} CurrentTimeDelay: {Time_delay}");
}
}
Note** Function app is not throwing any other exception other than dependency of 429 error response.
Would it be possible for you to requeue instead of using Thread.Sleep? You can use initial visibility delay when requeuing:
public class Function1
{
[FunctionName(nameof(TryDoWork))]
public static async Task TryDoWork(
[QueueTrigger("some-queue")] SomeItem item,
[Queue("some-queue")] CloudQueue queue)
{
var result = _SomeService.SomeWork(item);
if (result == 429)
{
item.Retries++;
var json = JsonConvert.SerializeObject(item);
var message = new CloudQueueMessage(json);
var delay = TimeSpan.FromSeconds(item.Retries);
await queue.AddMessageAsync(message, null, delay, null, null);
}
}
}
It might be that the sleeping is causing some wonky function app behavior. I think I remember reading some issues pertaining to the usage of Thread.Sleep, but I can't find it right now.
Also, you might want to add some sort of handling of messages that end up retrying more than 3 times (or however many you think is reasonable).

How to do Async in Azure WebJob function

I have an async method that gets api data from a server. When I run this code on my local machine, in a console app, it performs at high speed, pushing through a few hundred http calls in the async function per minute. When I put the same code to be triggered from an Azure WebJob queue message however, it seems to operate synchronously and my numbers crawl - I'm sure I am missing something simple in my approach - any assistance appreciated.
(1) .. WebJob function that listens for a message on queue and kicks off the api get process on message received:
public class Functions
{
// This function will get triggered/executed when a new message is written
// on an Azure Queue called queue.
public static async Task ProcessQueueMessage ([QueueTrigger("myqueue")] string message, TextWriter log)
{
var getAPIData = new GetData();
getAPIData.DoIt(message).Wait();
log.WriteLine("*** done: " + message);
}
}
(2) the class that outside azure works in async mode at speed...
class GetData
{
// wrapper that is called by the message function trigger
public async Task DoIt(string MessageFile)
{
await CallAPI(MessageFile);
}
public async Task<string> CallAPI(string MessageFile)
{
/// create a list of sample APIs to call...
var apiCallList = new List<string>();
apiCallList.Add("localhost/?q=1");
apiCallList.Add("localhost/?q=2");
apiCallList.Add("localhost/?q=3");
apiCallList.Add("localhost/?q=4");
apiCallList.Add("localhost/?q=5");
// setup httpclient
HttpClient client =
new HttpClient() { MaxResponseContentBufferSize = 10000000 };
var timeout = new TimeSpan(0, 5, 0); // 5 min timeout
client.Timeout = timeout;
// create a list of http api get Task...
IEnumerable<Task<string>> allResults = apiCallList.Select(str => ProcessURLPageAsync(str, client));
// wait for them all to complete, then move on...
await Task.WhenAll(allResults);
return allResults.ToString();
}
async Task<string> ProcessURLPageAsync(string APIAddressString, HttpClient client)
{
string page = "";
HttpResponseMessage resX;
try
{
// set the address to call
Uri URL = new Uri(APIAddressString);
// execute the call
resX = await client.GetAsync(URL);
page = await resX.Content.ReadAsStringAsync();
string rslt = page;
// do something with the api response data
}
catch (Exception ex)
{
// log error
}
return page;
}
}
First because your triggered function is async, you should use await rather than .Wait(). Wait will block the current thread.
public static async Task ProcessQueueMessage([QueueTrigger("myqueue")] string message, TextWriter log)
{
var getAPIData = new GetData();
await getAPIData.DoIt(message);
log.WriteLine("*** done: " + message);
}
Anyway you'll be able to find usefull information from the documentation
Parallel execution
If you have multiple functions listening on different queues, the SDK will call them in parallel when messages are received simultaneously.
The same is true when multiple messages are received for a single queue. By default, the SDK gets a batch of 16 queue messages at a time and executes the function that processes them in parallel. The batch size is configurable. When the number being processed gets down to half of the batch size, the SDK gets another batch and starts processing those messages. Therefore the maximum number of concurrent messages being processed per function is one and a half times the batch size. This limit applies separately to each function that has a QueueTrigger attribute.
Here is a sample code to configure the batch size:
var config = new JobHostConfiguration();
config.Queues.BatchSize = 50;
var host = new JobHost(config);
host.RunAndBlock();
However, it is not always a good option to have too many threads running at the same time and could lead to bad performance.
Another option is to scale out your webjob:
Multiple instances
if your web app runs on multiple instances, a continuous WebJob runs on each machine, and each machine will wait for triggers and attempt to run functions. The WebJobs SDK queue trigger automatically prevents a function from processing a queue message multiple times; functions do not have to be written to be idempotent. However, if you want to ensure that only one instance of a function runs even when there are multiple instances of the host web app, you can use the Singleton attribute.
Have a read of this Webjobs SDK documentation - the behaviour you should expect is that your process will run and process one message at a time, but will scale up if more instances are created (of your app service). If you had multiple queues, they will trigger in parallel.
In order to improve the performance, see the configurations settings section in the link I sent you, which refers to the number of messages that can be triggered in a batch.
If you want to process multiple messages in parallel though, and don't want to rely on instance scaling, then you need to use threading instead (async isn't about multi-threaded parallelism, but making more efficient use of the thread you're using). So your queue trigger function should read the message from the queue, the create a thread and "fire and forget" that thread, and then return from the trigger function. This will mark the message as processed, and allow the next message on the queue to be processed, even though in theory you're still processing the earlier one. Note you will need to include your own logic for error handling and ensuring that the data wont get lost if your thread throws an exception or can't process the message (eg. put it on a poison queue).
The other option is to not use the [queuetrigger] attribute, and use the Azure storage queues sdk API functions directly to connect and process the messages per your requirements.

Consumer/Producer with order and constraint on consumed items

I have the following scenario
I am writing a server that process files (jobs)
a file has a "prefix" and a time
the files should be processed according to time (older file first) but also take into account the prefix (files with same prefix can't be processed concurrently)
I have a thread (Task with Timer) that watches over a directory and adds files to a "queue" (producer)
I have several consumers that take the file from "queue" (consumer) - they should conform to the above rules.
the job of each task is kept in some list (this indicates the constraints)
There are several consumers, the number of consumers is determined at startup.
One of the requirement is to be able to gracefully stop the consumers (either immediately or let ongoing processes to finish).
I did something along this line:
while (processing)
{
//limits number of concurrent tasks
_processingSemaphore.Wait(queueCancellationToken);
//Take next job when available or wait for cancel signal
currentwork = workQueue.Take(taskCancellationToken);
//check that it can actually process this work
if (CanProcess(currnetWork)
{
var task = CreateTask(currentwork)
task.ContinueWith((t) => { //release processing slot });
}
else
//release slot, return job? something else?
}
The cancellation tokens sources are in the caller code and can be cancelled. There are two in order to be able to stop queuing while not cancelling running tasks.
I tired to implement the "queue" as BlockingCollection wrapping a "safe" SortedSet. The general idea work (ordering by time) except the case in which I need to find a new job that matches the constraint. If I return the job to the queue and try to take again I will get the same one.
It is possible to take jobs from the queue until I find a proper one and then returning the "illegal" jobs back but this may cause issues with other consumers processing out of order jobs
Another alternative is to pass a simple collection and a way to lock it and just lock and do a simple search according to current constraints. Again, this means writing code that will possibly not be thread-safe.
Any other suggestion / pointers / data structures that can help?
I think Hans is right: if you already have a thread-safe SortedSet (that implements IProducerConsumerCollection, so it can be used in BlockingCollection), then all you need is to put only files that can be processed right now into the collection. If you finish a file which makes another file available for processing, add the other file to the collection at this point, not earlier.
I would have implemented your requirement(s) with TPL Dataflow. Look at the way you could implement the Producer-Consumer pattern with it. I believe this will meet all the requirements you have (including cancellation on the consumers).
EDIT (for those that do not like to read documentation, but who does...)
Here is an example of how you could implement the requirements with TPL Dataflow. The beauty of this implementation is that consumers are not bound to a single thread and only uses a pool thread when it needs to process data.
static void Main(string[] args)
{
BufferBlock<string> source = new BufferBlock<string>();
var cancellation = new CancellationTokenSource();
LinkConsumer(source, "A", cancellation.Token);
LinkConsumer(source, "B", cancellation.Token);
LinkConsumer(source, "C", cancellation.Token);
// Link an action that will process source values that are not processed by other
source.LinkTo(new ActionBlock<string>((s) => Console.WriteLine("Default action")));
while (cancellation.IsCancellationRequested == false)
{
ConsoleKey key = Console.ReadKey(true).Key;
switch (key)
{
case ConsoleKey.Escape:
cancellation.Cancel();
break;
default:
Console.WriteLine("Posted value {0} on thread {1}.", key, Thread.CurrentThread.ManagedThreadId);
source.Post(key.ToString());
break;
}
}
source.Complete();
Console.WriteLine("Done.");
Console.ReadLine();
}
private static void LinkConsumer(ISourceBlock<string> source, string prefix, CancellationToken token)
{
// Link a consumer that will buffer and process all input of the specified prefix
var consumer = new ActionBlock<string>(new Action<string>(Process), new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 1, SingleProducerConstrained = true, CancellationToken = token, TaskScheduler = TaskScheduler.Default });
var linkDisposable = source.LinkTo(consumer, (p) => p == prefix);
// Dispose the link (remove the link) when cancellation is requested.
token.Register(linkDisposable.Dispose);
}
private static void Process(string arg)
{
Console.WriteLine("Processed value {0} in thread {1}", arg, Thread.CurrentThread.ManagedThreadId);
// Simulate work
Thread.Sleep(500);
}

Resources