Durable Function triggered with delay - azure

I'm currently facing a weird issue.
Randomly (I guess) my azure durable function invocation is triggered with delay >10min.
My understanding is that there's something wrong with the lease for control queue.
I'm on Consumption Plan. So i'm wondering if the scale-in/out mechanism is working properly with my durable function. My feeling is that a host instance takes the lease then goes into drain mode -> recycling etc. and keeps the lease during 10min before releasing.
My feeling is that it's happening after a period of inactivity.
Have you ever seen such behavior ?

I found the similar recent issue #1148771 reported on January 2023 in MS Q&A Forum of Azure Functions where the user is experiencing delay in durable functions and the hosting model is Consumption Plan during Orchestration start.
This case still is in investigation by the Microsoft Support team and mentioned some of the reasons such as:
The lease for control queue is held by previous function instance which is supposed to be recycled.
For the basic checks, make sure you have the latest versions of all the packages used in the function code.
Also, in the same Q&A Forum, mentioned the scenario what happened for causing the issue based on the timestamps, orchestration Ids provided. If your scenario is similar, then you can track the discussion for the solution.
If it is product issue, then you could raise a ticket in azure-functions-durable-extensions repository of the GitHub.
Refer to the GitHub Issue #606 and this MS Doc troubleshooting steps of Orchestration start delays in Azure Durable Functions.

Today, I’ve received the update from the product team that the issue is caused by the current Lease Management logic used by Durable Function. It could happen when an instance was shut down.
This issue is specific to Azure Storage backend.
This issue has been existing for a while. We're trying to refactor the
current Lease Management logic with hope to fix such issue, but no
clear ETA yet. You can subscribe to Orchestrations freezing
mid-execution · Issue #2207 · Azure/azure-functions-durable-extension
(github.com) to get the notification when the fix is ready.

Related

How to handle an Azure Function rerunning when using message queue binding?

I have a v1 Azure Function that is triggered by a message being written to the Azure Storage message queue.
The Azure Function needs to perform multiple updates to SharePoint Online. Occasionally these operations fail. This results in the message being returned to the queue and being reprocessed.
When I developed the function, I didn't consider that it might partially complete and then restart. I've done a little research and it sounds like I need to modify it to be re-entrant.
Is there a design pattern that I should follow to cater for this without having to add a lot of checks to determine if an operation has already been carried out by a previous execution? Alternatively, is there any Azure functionality that can help (beyond the existing message retries and poison queue)
It sounds like you will need to do some re-engineering. Our team had a similar issue and wrote a home-grown solution years ago. But we eventually scrapped our solution and went with Azure Durable Functions.
Not gonna lie - this framework has some complexity and it took me a bit to wrap my head around it. Check out the function chaining pattern.
We have processing that requires multiple steps that all must be completed. We're spanning multiple data stores (Updating Cosmos Db, Azure SQL, Blob Storage, etc), so there's no support for distributed transactions across multiple PaaS offerings. Durable Functions will allow you to break your process up into discrete steps. If a step fails, an orchestrator will re-run that step based on a retry policy.
So in a nutshell, we use Durable Task Activity functions to attempt each step. If the step fails due to what we think is a transient error, we retry. If it's an unrecoverable error, we don't retry.

Ensuring Azure Service Bus never loses a single message

I have a system where losing messages from Azure Service Bus would be a disaster, that is, the data would be lost forever with no practical means to repair the damage without major disruption.
Would I ever be able to rely on ASB entirely in this situation? (even if it was to go down for hours, it would always come back up with the same messages it was holding when it failed)
If not, what is a good approach for solving this problem where messages are crucial and cannot be lost? Is it common to simply persist the message to a database prior to publishing so that it can be referred to in the event of a disaster?
Azure Service Bus doesn’t lose messages. Once a message is successfully received by the broker, it’s there and doesn’t go anywhere. Where usually things go wrong is with message receiving and processing, which is user custom code. That’s one of the reasons why Service Bus has PeekLock receive mode and dead-lettering based on delivery count.
If you need strong guarantees, don’t reinvent the wheel yourself. Use messaging frameworks that do it for you, such as NServiceBus or MassTransit.
With azure service bus you can make this and be sure 99.99% percent,
at worst case you will find your message at the dead-letter queues but it will be never deleted.
Another choice is to use Azure Storage Queue and setting TTL to -1 it will give a infinity life time ,
but because i'am a little bit an old school and to be sure 101% I would suggest an manual solution using azure table storage,
so it's you who decide when add/delete or update a ligne because the criticty of information and data that you work with

Document-centric event scheduling on Azure

I'm aware of the many different ways of scheduling system-centric events in Azure. E.g. Azure Scheduler, Logic Apps, etc. These can be used for things like backups, sending batch emails, or other maintenance functions.
However, I'm less clear on what technology is available for events relating to a large volume of documents or records.
For example, imagine I have 100,000 documents in Cosmos and some of the datetime properties on those documents relate to events: e.g. expiry, reminders, escalations, timeouts, etc. Each record has a different set of dates and times.
What approaches are there to fire off code whenever one of those datetimes is reached?
Stuff I've thought of so far:
Have a scheduled task that runs once per minute and looks for anything relating to that particular minute in Cosmos then does "stuff".
Schedule tasks on Service Bus queues with a future date as-and-when the Cosmos records are created and then have something to receive those messages and do "stuff".
But are there better ways of doing this? Is there a ready-made Azure service that would take away much of the background infrastructure work and just let me schedule a single one-off event at a particular point in time and hit a webhook or something like that?
Am I mis-categorising Azure Scheduler as something that you'd use for a handful of regularly scheduled tasks rather than the mixed bag of dates and times you'd find in 100,000 Cosmos records?
FWIW, in my use-case there isn't really a precision issue - stuff scheduled for 10:05:00 happening at 10:05:32 would be perfectly acceptable, for example.
Appreciate your thoughts.
First of all, Azure Schedular will be replaced by Azure Logic Apps:
Azure Logic Apps is replacing Azure Scheduler, which is being retired. To schedule jobs, follow this article for moving to Azure Logic Apps instead.
(source)
That said, Azure Logic Apps is one of your options since you can define a logic apps that starts a one time job by using a delay activity. See the docs for details.
It scales very well and you can pay for what you use (or use a fixed pricing model).
Another option is using a durable azure function with a timer in it. Once elapsed, you could do your thing. You can use a consumption plan as well, so you pay only for what you use or you can use a fixed pricing model. It also scales very well so hundreds of those instances won't be a problem.
In both cases you have to trigger the function or logic app when the Cosmos records are created. Put the due time as context in the trigger and there you go.
Now, given your statement
I'm aware of the many different ways of scheduling system-centric events in Azure. E.g. Azure Scheduler, Logic Apps, etc. These can be used for things like backups, sending batch emails, or other maintenance functions.
That is up to you. You can do anything you want. You don't specify in your question what work needs to be done when the due time is reached but I doubt it is something you can't do with those services.

How are OS configuration changes controlled when using Service Fabric?

When using Azure web/worker roles users can specify osVersion to explicitly set "Guest OS image" version. This ensures that when Microsoft issues new critical updates they are first shown up on a newer "OS image" which users can explicitly specify and test their service on.
How is the same achieved with Azure Service Fabric? Suppose I deployed my service into Azure Service Fabric and it's been running for a month, then Microsoft issues updates for the OS on the server where the service is running - how are they applied such that I can test them first to ensure they don't break the service?
Brett is correct. SF cluster is based on Azure VMSS and the expectation is that the customer is responsible to patch the OS. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-upgrade/
We have heard from majority of the SF customers that this is not at all desirable and that they do not want to be responsible for OS patching.
The feature to enable an OPT-IN automatic OS patching is indeed a very high priority within Azure Compute team. The exact details on how best to offer this is still in design, however the intent is to have this functionality enabled before the end of the year.
Although that is the right long term solution, to mitigate this issue in the short term, SF team is working on a set of steps that will enable the customers to opt into having the their VMs patched using WU in a safe manner. Once the steps are tested out, we will blog about it and will publish a document detailing the steps. Expect that in the next couple of months.
As I understand it you are currently responsible for managing patching on SF cluster nodes yourself. Apparently moving this to be a SF managed feature is planned but I have no idea how far down the road it might be.
I personally would make this a high priority. Having used Cloud Services for many years I have come to rely on never having to patch my VM's manually. SF is a large backwards step in this particular area.
It'd be great to hear from an Azure PM on this...
Automatic Image based patching like cloud services in service fabric.
Today you do not have that option. The image based patching capability is work in progress. I posted a road map to get there on the team blog : https://blogs.msdn.microsoft.com/azureservicefabric/2017/01/09/os-patching-for-vms-running-service-fabric/ Try out the script and report any issues you hit. Looking forward to your feedback.
Lots of parts of Service Fabric are huge rolling dumpster fires backwards. Whole new hosts of problems have been introduced that the IIS/WAS/WCF team have already solved that need to be developed for once again. The concept of releasing a PAAS platform while requiring OS patch management is laughable. To add insult to injury there is no migration path from "classic cloud PAAS" to this stuff. WEEEE I get to write my very own service host. Something that was provided out of the box for a decade by WAS. Not all of us were scared by the ability to control all aspects of service host communication options via configuration. Now we get to use code so a tweak channel configuration requires a full patch/release cycle!

Recurring Timeout on Sql-Azure

On our system, which is implemented by a web role that uses a database sql-azure, we are experiencing recurring timeout on a specific query.
These timeouts occur for a few hours during the day and then do not show up anymore.
The query has two tables with a number of rows is not very high (about 800,000 rows) with joins using primary keys.
The execution plan is ok, the indexes are used properly, the query normally takes two seconds to be performed.
Tests without EntityFramework give the same result.
Transient fault handling are not applicable in the case of timeout.
What can be the cause of this behavior?
We have experienced similar issues in the past using SQL Azure; frequently queries running against tables with less that 10 rows and even the standard .Net membership provider queries, all failed intermittently with timeouts. This is usually when we have little to no activity on our service; mostly at night.
In commonly used areas where it is safe to retry on SQL Timeout (Usually read operations) we have added the timeout exception to our custom error detection strategy, taken from the Transient Fault Handling Block; however as you stated this is not appropriate in most cases.
The best explanation we have received from Azure support thus far is that as SQL Azure is really a shared SQL Server instance that is used by multiple clients; if one user performs an intensive operation it can affect other users in this way. However; believe this not to be acceptable we are still in contact with SQL Azure support to ascertain why throttling is not stopping this sort of activity from affecting us.
You best bet is to:
Contact SQL Azure Support either through the forums or directly (If you have a support package)
If possible; try setting up a new SQL Azure instance and migrating your database across
Whilst we get this issue intermittently on one SQL Azure instance; we have never experienced it on our other 2 instances.
As a side note; we are still waiting on Azure Support to get back to us regarding why we were still receiving timeout exceptions.

Resources