Slow down EventHubTrigger in Azure Function - azure

I have a simple Azure function which:
as input uses EventHubTrigger
internally write some events to the Azure Storage
During some part of the day the average batch size is around 250+, which is great because I write less block to Azure Storage, but for most of the time the batch size is less then 10.
Is there anyway to force EventHubTrigger to wait until there are more than 50/100/200 messages to process, so I can reduce append blocks in Azure Storage.

Update--
The host.json file contains settings that control Event Hub trigger behavior. See the host.json settings section for details regarding available settings.
You can specify a timestamp such that you receive events enqueued only after the given timestamp.
initialOffsetOptions/enqueuedTimeUtc: Specifies the enqueued time of the event in the stream from which to start processing. When initialOffsetOptions/type is configured as fromEnqueuedTime, this setting is mandatory. Supports time in any format supported by DateTime.Parse(), such as 2020-10-26T20:31Z. For clarity, you should also specify a timezone. When timezone isn't specified, Functions assumes the local timezone of the machine running the function app, which is UTC when running on Azure. For more information, see the EventProcessorOptions documentation.
---
If I understand your ask correctly, you want to hold all the received events until a certain threshold and then process them at once in a singe Azure function run.
To receive events in a batch, make string or EventData an array. In the binding configuration properties that you set in the function.json file and the EventHubTrigger attribute. Set function.json property "cardinality" to "many" in order to enable batching. If omitted or set to one, a single message is passed to the function. In C#, this property is automatically assigned whenever the trigger has an array for the type.
Note: When receiving in a batch you cannot bind to method parameters with event metadata and must receive these from each EventData object
Further, I could not find an explicit way to do this task. You would have to modify your function, like using a timer trigger or use the blob queue storage API to read all messages in a buffer, and then write to a blob via blob binding.

Related

Azure storage queue triggered function starts multiple times

All,
I have storage queue triggered Azure Function. It loads various data into a database from files. I specify the input file in the message sent into the input queue.
However when I send a message into the queue my function starts in multiple instances and tries to insert the same file into the db. If I log msg.dequeue_count I see it rising.
What shall I do to start only one function for each message? Please note I'd like to keep the possibility to start multiple instance for multiple messages to load different files parallel.
This question was also asked here and the answer was to check out the chart comparing storage and service bus queues.
Bottom line is that storage queues offer 'at least once' delivery. If you want 'at most once' you should use service bus and PeekLock or ReceiveAndDelete.

Is there a way to find out which partition is the message written if using EventHub.SendAsync(EventData)?

Is there a way to find out which partition is the message written if using EventHub.SendAsync(EventData) from Azure EventHub client SDK ?
We intentionally do not provide a partition key so EventHub service can do its internal load balancing but want to find out which partition the data is written to eventually, for diagnosing issues with the end to end data flow.
Ivan's answer is correct in the context of the legacy SDK (Microsoft.Azure.EventHubs), but the current generation (Azure.Messaging.EventHubs) is slightly different. You don't mention a specific language, but conceptually the answer is the same across them. I'll use .NET to illustrate.
If you're not using a call that requires specifying a partition directly when reading events, then you'll always have access to an object that represents the partition that an event was read from. For example, if you're using the EventHubConsumerClient method ReadEventsAsync to explore, you'll be see the PartitionEvent where the Partition property tells you the partition that the Data was read from.
When using the EventProcessorClient, your ProcessEventAsync handler will be invoked with a set of ProcessEventArgs where the Partition property tells you the partition that the Data was read from.
There is no direct way to this. But there're 2 workarounds.
1.Use the Event Hubs Capture to store the incoming events, then check the events in the specified blob storage. When the events are stored in blob storage, the path contains the partition id, so you can know it.
2.Use code. Create a new Consumer Group, and follow this article to read events. And in this section, there is a method public Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages). You can take use of the parameter PartitionContext to get the event's partition id(by using context.PartitionId).

Available methods to monitor Azure Event hub parititons queue size

Kafka provides capability to monitor current offset and latest offset. Similarly does azure eventhub expose any api to continously monitor partition's current offset and latest available offset?
Extending above answer, you can see offset by 2 ways.
Print offset in log file where you are listening EventHub
e.g using Azure function
public static async Task Run([EventHubTrigger("EventHubname", ConsumerGroup = "ConsumerGroupname", Connection = "EventHubConnection")]EventData eventMessage,
[Inject]IService service, [Inject]ILog log)
{
log.Info($"PartitionKey {eventMessage.PartitionKey}, Offset {eventMessage.Offset} and SequenceNumber {eventMessage.SequenceNumber}");
}
I am listening Eventhub by Azure Functions, You can see below location where Azure function maintain offset by partition.
Option 3 (Latest)
Offset is not the correct way to measure the depth of Eventhub, Specially when you want to check how many messages need to process.
Now We are using Eventhub message SequenceNumber instead Offset. We have created TimerTrigger Azure Function. On every 5 minutes, we are getting LastEnqueuedSequenceNumber from Eventhub and SequenceNumber for each partition from blob storage (checkpoint location), then we storing difference in ApplicationInsight customMetrics.
Then ApplicationInsights help us to PIN Eventhub depth information in Azure dashboard and set up an alert.
Timer Trigger Code
I hope this will help!
Looking at Features and terminology in Azure Event Hubs - Event consumers - Stream offsets:
An offset is the position of an event within a partition. You can think of an offset as a client-side cursor. The offset is a byte numbering of the event. This offset enables an event consumer (reader) to specify a point in the event stream from which they want to begin reading events. You can specify the offset as a timestamp or as an offset value. Consumers are responsible for storing their own offset values outside of the Event Hubs service. Within a partition, each event includes an offset.
And also under Common consumer tasks - Read events:
As events are sent to the client, each event data instance contains important metadata such as the offset and sequence number that are used to facilitate checkpointing on the event sequence.
There do not seem to be any methods you can use to monitor the offset since you need to do this yourself.

How to persist state in Azure Function (the cheap way)?

How can I persist a small amount of data between Azure Function executions? Like in a global variable? The function runs on a Timer Trigger.
I need to store the result of one Azure Function execution and use this as input of the next execution of the same function. What is the cheapest (not necessarily simplest) way of storing data between function executions?
(Currently I'm using the free amount of Azure Functions that everyone gets and now I'd like to save state in a similar free or cheap way.)
There are a couple options - I'd recommend that you store your state in a blob.
You could use a blob input binding to read global state for every execution, and a blob output binding to update that state.
You could also remove the timer trigger and use queues, with the state stored in the queue message and a visibility timeout on the message to set the schedule (i.e next execution time).
Finally, you could use a file on the file system, as it is shared across the function app.
If you can accept the possibility of data loss and only care at the instance level, you can:
maintain a static data structure
write to instance local storage
Durable entities are now available to handle persistence of state.
https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-entities?tabs=csharp
This is old thread but worth sharing the new way to handle state in Azure function.
Now we have the durable function approach from Microsoft itself where we can maintain the function state very easily and effectively. Please refer the below documentation from MS.
https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview

Requeue or delete messages in Azure Storage Queues via WebJobs

I was hoping if someone can clarify a few things regarding Azure Storage Queues and their interaction with WebJobs:
To perform recurring background tasks (i.e. add to queue once, then repeat at set intervals), is there a way to update the same message delivered in the QueueTrigger function so that its lease (visibility) can be extended as a way to requeue and avoid expiry?
With the above-mentioned pattern for recurring background jobs, I'm also trying to figure out a way to delete/expire a job 'on demand'. Since this doesn't seem possible outside the context of WebJobs, I was thinking of maybe storing the messageId and popReceipt for the message(s) to be deleted in Table storage as persistent cache, and then upon delivery of message in the QueueTrigger function do a Table lookup to perform a DeleteMessage, so that the message is not repeated any more.
Any suggestions or tips are appreciated. Cheers :)
Azure Storage Queues are used to store messages that may be consumed by your Azure Webjob, WorkerRole, etc. The Azure Webjobs SDK provides an easy way to interact with Azure Storage (that includes Queues, Table Storage, Blobs, and Service Bus). That being said, you can also have an Azure Webjob that does not use the Webjobs SDK and does not interact with Azure Storage. In fact, I do run a Webjob that interacts with a SQL Azure database.
I'll briefly explain how the Webjobs SDK interact with Azure Queues. Once a message arrives to a queue (or is made 'visible', more on this later) the function in the Webjob is triggered (assuming you're running in continuous mode). If that function returns with no error, the message is deleted. If something goes wrong, the message goes back to the queue to be processed again. You can handle the failed message accordingly. Here is an example on how to do this.
The SDK will call a function up to 5 times to process a queue message. If the fifth try fails, the message is moved to a poison queue. The maximum number of retries is configurable.
Regarding visibility, when you add a message to the queue, there is a visibility timeout property. By default is zero. Therefore, if you want to process a message in the future you can do it (up to 7 days in the future) by setting this property to a desired value.
Optional. If specified, the request must be made using an x-ms-version of 2011-08-18 or newer. If not specified, the default value is 0. Specifies the new visibility timeout value, in seconds, relative to server time. The new value must be larger than or equal to 0, and cannot be larger than 7 days. The visibility timeout of a message cannot be set to a value later than the expiry time. visibilitytimeout should be set to a value smaller than the time-to-live value.
Now the suggestions for your app.
I would just add a message to the queue for every task that you want to accomplish. The message will obviously have the pertinent information for processing. If you need to schedule several tasks, you can run a Scheduled Webjob (on a schedule of your choice) that adds messages to the queue. Then your continuous Webjob will pick up that message and process it.
Add a GUID to each message that goes to the queue. Store that GUID in some other domain of your application (a database). So when you dequeue the message for processing, the first thing you do is check against your database if the message needs to be processed. If you need to cancel the execution of a message, instead of deleting it from the queue, just update the GUID in your database.
There's more info here.
Hope this helps,
As for the first part of the question, you can use the Update Message operation to extend the visibility timeout of a message.
The Update Message operation can be used to continually extend the
invisibility of a queue message. This functionality can be useful if
you want a worker role to “lease” a queue message. For example, if a
worker role calls Get Messages and recognizes that it needs more time
to process a message, it can continually extend the message’s
invisibility until it is processed. If the worker role were to fail
during processing, eventually the message would become visible again
and another worker role could process it.
You can check the REST API documentation here: https://msdn.microsoft.com/en-us/library/azure/hh452234.aspx
For the second part of your question, there are really multiple ways and your method of storing the id/popReceipt as a lookup is a possible option, you can actually have a Web Job dedicated to receive messages on a different queue (e.g plz-delete-msg) and you send a message containing the "messageId" and this Web Job can use Get Message operation then Delete it. (you can make the job generic by passing the queue name!)
https://msdn.microsoft.com/en-us/library/azure/dd179474.aspx
https://msdn.microsoft.com/en-us/library/azure/dd179347.aspx

Resources