How can we reprocess cosmos lease document using Azure Function? - azure

I am new to Azure Function, recently we tried to use CosmosDBTriggered Function that needs to create the lease document, we noticed that when there is something changed in the Cosmos Container, there will be a new entry added into the lease document, but we don't understand what these items mean and how could we use it in other scenario instead just log it. In addition, sometimes we would have an exception in the CosmosDBTriggered Function, while exception happened our function just stops itself and we're losing all changed documents in that instance, so we're thinking if there is anyway to recapture our changed items in last triggered event by using the lease document, but not sure what the lease document could tell us, could someone explain if that is approachable?

From the official documentation at https://learn.microsoft.com/azure/cosmos-db/change-feed-functions
The lease container: The lease container maintains state across multiple and dynamic serverless Azure Function instances and enables dynamic scaling. This lease container can be manually or automatically created by the Azure Functions trigger for Cosmos DB. To automatically create the lease container, set the CreateLeaseCollectionIfNotExists flag in the configuration. Partitioned lease containers are required to have a /id partition key definition.
Going to your second question, error handling. The reference document is: https://learn.microsoft.com/azure/cosmos-db/troubleshoot-changefeed-functions
The Azure Functions trigger for Cosmos DB, by default, won't retry a batch of changes if there was an unhandled exception during your code execution.
If your code throws an unhandled exception, the current batch of changes that was being processed is lost because the Function will exit and record an Error, and continue with the next batch.
In this scenario, the best course of action is to add try/catch blocks in your code and inside the loops that might be processing the changes, to detect any failure for a particular subset of items and handle them accordingly (send them to another storage for further analysis or retry).
So, make sure you have try/catch blocks in your foreach/for statements, detect any Exception, deadletter that failed document, and continue with the next in the batch.
This approach is common to all event-based Function triggers, like Event Hub. For reference: https://hackernoon.com/reliable-event-processing-in-azure-functions-37054dc2d0fc
If you want to reset a Cosmos DB Trigger to go back and replay the documents from the start, after already having the Trigger working for some time, you need to:
Stop your Azure function if it is currently running.
Delete the documents in the lease collection (or delete and re-create the lease collection so it is empty)
Set the StartFromBeginning CosmosDBTrigger attribute in your function to true.
Restart the Azure function. It will now read and process all changes from the beginning.

Related

Retriggering an Azure Cosmos DB Trigger

I've ran into a problem with my Azure Cosmos DB trigger. Apparently some of the triggers failed and thus didn't complete sending the data to a specific service. As far as I can see, there is no easy way to 'retrigger' those events, without actually inserting the data in Cosmos again.
I read somewhere that I could insert the incoming data from the trigger into a ServiceBus queue message and handle it from there. Then I can use the deadletter queue to potentially requeue failed items. However, the messages contain a couple kB's of data. I'm not sure if that is wise..
What would be the best way to tackle this issue?
Thanks!
You can only retrigger by
Modifying the document (replace)
Manually call the trigger by using the API and pass the document's content
Put the message into a separate queue as you mentioned
Use retries on the CosmosDB trigger for short lived transient issues.
We have been doing the ServiceBus solution for quite a bit now without any issues. The maximum message size is 256KB for standard tier, which is plenty.
If the size is really an issue for you, you could only put the documentId into ServiceBus. However this creates a solution that is more read-intensive for your CosmosDB. If you want to avoid that then the solution gets even more complex.
This is already quite opinionated, but the ServiceBus solution is in my experience very robust and not very complex. You can use the manual approach if you only need this very rarely to "fake" re-triggering of the event.

Bulk delete in Azure CosmosDB leads to 429 error

I've implemented bulk deletion as recommended with newer SDK. Created a list of tasks to delete each item and then awaited them all. And my CosmosClient was configured with BulkOperations = true. As I understand, it's implied that under the hood new SDK does its magic and performs bulk operation.
Unfortunatelly, I've encountered 429 response status. Meaning my multiple requests hit request rate limit (it is low, development only tier, but nontheless). I wonder, how single bulk operation might cause 429 error. And how to implement bulk deletion in not "per item" fashion.
UPDATE: I use Azure Cosmos DB .NET SDK v3 for SQL API with bulk operations support as described in this article https://devblogs.microsoft.com/cosmosdb/introducing-bulk-support-in-the-net-sdk/
You need to handle 429s for deletes the way you'd handle for any operation by creating an exception block, trapping for the status code, then checking the retry-after value in the header, then sleeping and retrying after that amount of time.
PS if you're trying to delete all the data in the container, it can be more efficient to delete then recreate the container.

Azure Function Event Hub Trigger reliability

I'm a bit confused regarding the EventHubTrigger for Azure functions.
I've got an IoT Hub, and am using its eventhub-compatible endpoint to trigger an Azure function that is going to process and store the received data.
However, if my function fails (= throws an exception), that message (or messages) being processed during that function call will get lost. I actually would expect the Azure function runtime to process the messages at a later time again. Specifically, I would expect this behavior because the EventHubTrigger is keeping checkpoints in the Function Apps storage account in order to keep track of where in the event stream it has to continue.
The documention of the EventHubTrigger even states that
If all function executions succeed without errors, checkpoints are added to the associated storage account
But still, even when I deliberately throw exceptions in my function, the checkpoints will get updated and the messages will not get received again.
Is my understanding of the EventHubTriggers documentation wrong, or is the EventHubTriggers implementation (or its documentation) wrong?
This piece of documentation seems confusing indeed. I guess they mean the errors of Function App host itself, not of your code. An exception inside function execution doesn't stop the processing and checkpointing progress.
The fact is that Event Hubs are not designed for individual message retries. The processor works in batches, and it can either mark the whole batch as processed (i.e. create a checkpoint after it), or retry the whole batch (e.g. if the process crashed).
See this forum question and answer.
If you still need to re-process failed events from Event Hub (and errors don't happen too often), you could implement such mechanism yourself. E.g.
Add an output Queue binding to your Azure Function.
Add try-catch around processing code.
If exception is thrown, add the problematic event to the Queue.
Have another Function with Queue trigger to process those events.
Note that the downside of this is that you will loose ordering guarantee provided by Event Hubs (since Queue message will be processed later than its neighbors).
Quick fix. As retry policy would not work if down system is down for few hours. You can call Process.GetCurrentProcess().Kill(); in exception handling. This would stop the checkpoint moving forward. I have tested this with consumption based function app. You will not see anything in logs but i added email to notify that something went wrong and to avoid data loss i have killed the function instance.
Hope this helps.
Would put an blog over it and other part of workflow where I stop function in case of continuous failure on down system using logic app.

Requeue or delete messages in Azure Storage Queues via WebJobs

I was hoping if someone can clarify a few things regarding Azure Storage Queues and their interaction with WebJobs:
To perform recurring background tasks (i.e. add to queue once, then repeat at set intervals), is there a way to update the same message delivered in the QueueTrigger function so that its lease (visibility) can be extended as a way to requeue and avoid expiry?
With the above-mentioned pattern for recurring background jobs, I'm also trying to figure out a way to delete/expire a job 'on demand'. Since this doesn't seem possible outside the context of WebJobs, I was thinking of maybe storing the messageId and popReceipt for the message(s) to be deleted in Table storage as persistent cache, and then upon delivery of message in the QueueTrigger function do a Table lookup to perform a DeleteMessage, so that the message is not repeated any more.
Any suggestions or tips are appreciated. Cheers :)
Azure Storage Queues are used to store messages that may be consumed by your Azure Webjob, WorkerRole, etc. The Azure Webjobs SDK provides an easy way to interact with Azure Storage (that includes Queues, Table Storage, Blobs, and Service Bus). That being said, you can also have an Azure Webjob that does not use the Webjobs SDK and does not interact with Azure Storage. In fact, I do run a Webjob that interacts with a SQL Azure database.
I'll briefly explain how the Webjobs SDK interact with Azure Queues. Once a message arrives to a queue (or is made 'visible', more on this later) the function in the Webjob is triggered (assuming you're running in continuous mode). If that function returns with no error, the message is deleted. If something goes wrong, the message goes back to the queue to be processed again. You can handle the failed message accordingly. Here is an example on how to do this.
The SDK will call a function up to 5 times to process a queue message. If the fifth try fails, the message is moved to a poison queue. The maximum number of retries is configurable.
Regarding visibility, when you add a message to the queue, there is a visibility timeout property. By default is zero. Therefore, if you want to process a message in the future you can do it (up to 7 days in the future) by setting this property to a desired value.
Optional. If specified, the request must be made using an x-ms-version of 2011-08-18 or newer. If not specified, the default value is 0. Specifies the new visibility timeout value, in seconds, relative to server time. The new value must be larger than or equal to 0, and cannot be larger than 7 days. The visibility timeout of a message cannot be set to a value later than the expiry time. visibilitytimeout should be set to a value smaller than the time-to-live value.
Now the suggestions for your app.
I would just add a message to the queue for every task that you want to accomplish. The message will obviously have the pertinent information for processing. If you need to schedule several tasks, you can run a Scheduled Webjob (on a schedule of your choice) that adds messages to the queue. Then your continuous Webjob will pick up that message and process it.
Add a GUID to each message that goes to the queue. Store that GUID in some other domain of your application (a database). So when you dequeue the message for processing, the first thing you do is check against your database if the message needs to be processed. If you need to cancel the execution of a message, instead of deleting it from the queue, just update the GUID in your database.
There's more info here.
Hope this helps,
As for the first part of the question, you can use the Update Message operation to extend the visibility timeout of a message.
The Update Message operation can be used to continually extend the
invisibility of a queue message. This functionality can be useful if
you want a worker role to “lease” a queue message. For example, if a
worker role calls Get Messages and recognizes that it needs more time
to process a message, it can continually extend the message’s
invisibility until it is processed. If the worker role were to fail
during processing, eventually the message would become visible again
and another worker role could process it.
You can check the REST API documentation here: https://msdn.microsoft.com/en-us/library/azure/hh452234.aspx
For the second part of your question, there are really multiple ways and your method of storing the id/popReceipt as a lookup is a possible option, you can actually have a Web Job dedicated to receive messages on a different queue (e.g plz-delete-msg) and you send a message containing the "messageId" and this Web Job can use Get Message operation then Delete it. (you can make the job generic by passing the queue name!)
https://msdn.microsoft.com/en-us/library/azure/dd179474.aspx
https://msdn.microsoft.com/en-us/library/azure/dd179347.aspx

Using Azure Queues as a State Machine

I'd like to use Azure Queues as a state machine for a high-load/high-scale web service.
The client would submit a request to a web service endpoint, at which point i'd return a request id.
I'd then submit the message to a queue so that a worker role can process it, but no database activity occurs during the submission process. Instead, I want to use the queue that the message lives in to represent it's current state.
My problem is that if a worker role grabs the message off the queue to process it, it becomes invisible on that queue. If I want to check the status of the processing of that message, I have an ambiguous message state. Either the message was lost/never received, or it's in the queue but invisible because it's being processed.
Ideally, I'd like to be able to peak at the invisible message. If I find one that matches the request id, I know it's being processed if it's invisible, or it's waiting to be processed if it's visible. Obviously, I know when it's completed processing because that operation will result in a database write.
So is this possible, or is the fact that I can't peek at invisible messages in an Azure queue make this a no?
Windows Azure Storage Queues are for message-passing. They're not going to help you for state-machine processing, especially since each message can be processed at least once (since an app can run into an unexpected exception case while processing a message, the vm instance could crash, etc., and then the queue message re-appears after timeout (and now potentially out of order with the rest of your messages.
You're better off using an Azure Table row (or SQL table row).
In this case, I'd recommend using a blob to store the status of the message. Whenever a worker picks up a message, the blobID could be included and the worker can update the status blob. Your out-of-band process/website/whatever can query the blob to gather status information.

Resources