How to prevent losing telemetry events with Application Insight's Persistence Channel?

How to prevent losing telemetry events with Application Insight's Persistence Channel? - azure

I have integrated Microsoft Application Insights into my Windows Forms app. In the document Application Insights on Windows Desktop apps, services and worker roles which uses the default in-memory channel, after flushing the application sleeps for one second before exiting.
tc.Flush(); // only for desktop apps
// Allow time for flushing:
System.Threading.Thread.Sleep(1000);
The document states:
Note that Flush() is synchronous for the persistence channel, but asynchronous for other channels.
As this example is using the in-memory channel, I can deduct that flushing in the code example is asynchronous, hence the sleep.
In my code I'm using the persistence channel. Just before exiting my program I'm raising an event Application Shutdown:
static void Main(string[] args)
{
try { /* application code */ }
finally
{
Telemetry.Instance.TrackEvent("Application Shutdown");
Telemetry.Instance.Flush();
System.Threading.Thread.Sleep(1000); // allow time for flushing
}
}
Sticking to the documentation, Flush is synchronous so the sleep is not needed prior to application exit. Looking at the events arriving in the Azure portal, I can see though that for most users the Application Shutdown event is not arriving in the cloud. Stepping through the debugger and stepping over Flush I cannot feel any delay also.
I'm sure that I use persistence channel because I can see data is buffered in %LOCALAPPDATA%\Microsoft\ApplicationInsights.
My questions is:
As Persistence Channel's Flush clearly is synchronous, what could be the issue that the last events of every application run are not displayed in Azure?

If I remember this correctly, Flush() synchronously writes the remaining telemetry to the buffer (%LOCALAPPDATA% in case of the persistent channel), but it does not initiate any delivery action. I would expect this telemetry to show up later on with the next application start if the buffer location does not change because AI will read the buffered data and will send it out.
I might be mistaken here, the logic behind this could've been changed a while ago..

Related

EventHub data bursty with long pauses

I'm seeing multi-second pauses in the event stream, even reading from the retention pool.
Here's the main nugget of EH setup:
BlobContainerClient storageClient = new BlobContainerClient(blobcon, BLOB_NAME);
RTMTest.eventProcessor = new EventProcessorClient(storageClient, consumerGroup, ehubcon, EVENTHUB_NAME);
And then the do nothing processor:
static async Task processEventHandler(ProcessEventArgs eventArgs)
{
RTMTest.eventsPerSecond++;
RTMTest.eventCount++;
if ((RTMTest.eventCount % 16) == 0)
{
await eventArgs.UpdateCheckpointAsync(eventArgs.CancellationToken);
}
}
And then a typical execution:
15:02:23: no events
15:02:24: no events
15:02:25: reqs=643
15:02:26: reqs=656
15:02:27: reqs=1280
15:02:28: reqs=2221
15:02:29: no events
15:02:30: no events
15:02:31: no events
15:02:32: no events
15:02:33: no events
15:02:34: no events
15:02:35: no events
15:02:36: no events
15:02:37: no events
15:02:38: no events
15:02:39: no events
15:02:40: no events
15:02:41: no events
15:02:42: no events
15:02:43: no events
15:02:44: reqs=3027
15:02:45: reqs=3440
15:02:47: reqs=4320
15:02:48: reqs=9232
15:02:49: reqs=4064
15:02:50: reqs=395
15:02:51: no events
15:02:52: no events
15:02:53: no events
The event hub, blob storage and RTMTest webjob are all in US West 2. The event hub as 16 partitions. It's correctly calling my handler as evidenced by the bursts of data. The error handler is not called.
Here are two applications side by side, left using Redis, right using Event Hub. The events turn into the animations so you can visually watch the long stalls. Note: these are vaccines being reported around the US, either live or via batch reconciliations from the pharmacies.
vaccine reporting animations
Any idea why I see the multi-second stalls?
Thanks.

Event Hubs consumers make use of a prefetch queue when reading. This is essentially a local cache of events that the consumer tries to keep full by streaming in continually from the service. To prioritize throughput and avoid waiting on the network, consumers read exclusively from prefetch.
The pattern that you're describing falls into the "many smaller events" category, which will often drain the prefetch quickly if event processing is also quick. If your application is reading more quickly than the prefetch can refill, reads will start to take longer and return fewer events, as it waits on network operations.
One thing that may help is to test using higher values for PrefetchCount and CacheEventCount in the options when creating your processor. These default to a prefetch of 300 and cache event count of 100. You may want try testing with something like 750/250 and see what happens. We recommend keeping at least a 3:1 ratio.
It is also possible that your processor is being asked to do more work than is recommended for consistent performance across all partitions it owns. There's good discussion of different behaviors in the Troubleshooting Guide, and ultimately, capturing a +/- 5-minute slice of the SDK logs described here would give us the best view of what's going on. That's more detail and requires more back-and-forth discussion than works well on StackOverflow; I'd invite you to open an issue in the Azure SDK repository if you go down that path.
Something to keep in mind is that Event Hubs is optimized to maximize overall throughput and not for minimizing latency for individual events. The service offers no SLA for the time between when an event is received by the service and when it becomes available to be read from a partition.
When the service receives an event, it acknowledges receipt to the publisher and the send call completes. At this point, the event still needs to be committed to a partition. Until that process is complete, it isn't available to be read. Normally, this takes milliseconds but may occasionally take longer for the Standard tier because it is a shared instance. Transient failures, such as a partition node being rebooted/migrated, can also impact this.
With you near real-time reading, you may be processing quickly enough that there's nothing client-side that will help. In this case, you'd need to consider adding more TUs, moving to a Premium/Dedicated tier, or using more partitions to increase concurrency.
Update:
For those interested without access to the chat, log analysis shows a pattern of errors that indicates that either the host owns too many partitions and load balancing is unhealthy or there is a rogue processor running in the same consumer group but not using the same storage container.
In either case, partition ownership is bouncing frequently causing them to stop, move to a new host, reinitialize, and restart - only to stop and have to move again.
I've suggested reading through the Troubleshooting Guide, as this scenario and some of the other symptoms tare discussed in detail.
I've also suggested reading through the samples for the processor - particularly Event Processor Configuration and Event Processor Handlers. Each has guidance around processor use and configuration that should be followed to maximize throughput.

#jesse very patiently examined my logs and led me to the "duh" moment of realizing I just needed a separate consumer group for this 2nd application of the EventHub data. Now things are rock solid. Thanks Jesse!

Flush() in Azure App Insights

For Flush() method in Azure App Insights, I was wondering if it impacts the performance of the project?
I tried to remove Flush() and all the custom data are still sent to App Insights.So my question should be why do we need the Flush()? Can we remove it?

Flush() on TelemetryClient pushes all the data it currently has in a buffer to the App Insights service.
You can see its source code here: https://github.com/Microsoft/ApplicationInsights-dotnet/blob/3115fe1cc866a15d09e9b5f1f7f596385406433d/src/Microsoft.ApplicationInsights/TelemetryClient.cs#L593.
Normally, Application Insights will send your data in batches in the background so it uses the network more efficiently.
If you have developer mode enabled or call Flush() manually, data is sent immediately.
Typically you do not need to call Flush().
But in a case where you know the process will exit after that point, you'll want to call Flush() to make sure all the data is sent.

How does ILogger logs to Azure Application Insights?

In an Azure Function, when you enable telemetry to Application Insight and fire a (for example) logger.LogInformation call (where logger is an ILogger instance), does it send it to the Application Insight instance asynchronously (ie non-blocking), synchronously (blocking), or through a local log that gets drained asynchronously?

Generally, the logger would be hooked up to turn log calls into the various trackMessage or related calls in the Application Insights SDK. those messages get batched up in the AI side, and then sent after a threshold count of messages has been met, or after a certain amount of time has elapsed. the calls into application insights are all non-blocking, and will not throw exceptions (you don't want telemetry to negatively affect your real app!)
the c# sdks that azure functions would use would be here: https://github.com/Microsoft/ApplicationInsights-dotnet/
I said generally at the top, because all this depends on how the SDK is configured, and that would be up to the Azure functions underlying code. The GitHub with their info is here: https://github.com/Azure/Azure-Functions, and they have a specific wiki set up with AI info as well, here: https://github.com/Azure/Azure-Functions/wiki/App-Insights

This appears the be the relevant code for specifically how data is sent to Application Insights:
https://github.com/Microsoft/ApplicationInsights-dotnet/tree/develop/src/Microsoft.ApplicationInsights/Channel
The ILogger wraps a TelemetryClient, which sends data to an ITelemetryChannel.
The InMemoryTelemetryChannel contains the logic for how data is pooled and sent to Application Insights. As John mentioned, the channel uses a "buffer" for storing data that hasn't been sent. The buffer is flushed and the data sent asynchronously to Azure Portal when either the buffer is full or at a specific time internal (30 seconds).

NServiceBus and Azure long running handler pattern

We are using Azure service bus via NServiceBus and I am facing a problem with deciding the correct architecture for dealing with long running tasks as a result of messages.
As is good practice, we don't want to block the message handler from returning by making it wait for long running processes (downloading a large file from a remote server), and actually doing so will cause the lock on the message to be lost with Azure SB. The plan is to respond by spawning a separate task and allow the message handler to return immediately.
However this means that the handler is now immediately available for the next message which will cause another task to be spawned and so on until the message queue is empty. What I'd like is some way to stop taking messages while we are processing (a limited number of) earlier messages. Is there an accepted pattern for this with NServiceBus and Azure Service Bus?
The following is what I'd kind of do if I was programming directly against the Azure SB
{
while(true)
{
var message = bus.Next();
message.Complete();
// Do long running stuff here
}
}
The verbs Next and Complete are probably wrong but what happens under Azure is that Next gets a temporary lock on the message so that other consumers can no longer see the message. Then you can decide if you really want to process the message and if so call Complete. That then removes the message from the queue entirely, failing to do so will cause the message to appear back on the queue after a period of time as Azure assumes you crashed. As dirty as this code looks it would achieve my goals (so why not do it?) as my consumer is only going to consume the next time I'm available (after the long running task). Other consumers (other instances) can jump in if necessary.
The problem is that NServiceBus adds a level of abstraction so that now handling a message is via a method on a handler class.
void Handle(NewFileMessage message)
{
// Do work here
}
The problem is that Azure does not get the call to message.Complete() until after your work and after the Handle method exits. This is why you need to keep the work short. However if you exit you also signal that you are ready to handle another message. This is my Catch 22

Downloading on a background thread is a good idea. You don't want to to increase lock duration, because that's a symptom, not the problem. Your download can easily get longer than maximum lock duration (5mins) and then you're back to square one.
What you can do is have an orchestrating saga for download. Saga can monitor the download process and when download is completed, b/g process would signal to the saga about completion. If download is never finished, you can have a timeout (or multiple timeouts) to indicate that and have a compensating action or retry, whatever works for your business case.
Documentation on Sagas should get you going: http://docs.particular.net/nservicebus/sagas/

In Azure Service Bus you can increase the lock duration of a message (default set to 30 seconds) in case the handling will take a long time.
But, besides you are able to increase your lock duration, it's generally an indication that your handler takes care of to much work which can be divided over different handlers.

If it is critical that the file is downloaded, I would keep the download operation in the handler. That way if the download fails the message can be handled again and the download can be retried. If however you want to free up the handler instantly to handle more messages, I would suggest that you scale out the workers that perform the download task so that the system can cope with the demand.

Listen to Queue (Event Driven no polling) Service-Bus / Storage Queue

I'm trying to figure out how can I listen to an events on a queue (especially an enqueue event).
Say I have a Console Application and a Service Bus Queue/Topic, how can I connect to the Queue and wait for a new message ?
I'm trying to achieve this without While(true) and constant polling, I'm trying to do it more in a quite listener way something like a socket that stay connected to the queue.
The reason I don't want to use polling is that I understand that its floods the server with requests, I need a solution that will work on great loads.
Thank you.
I gave very basic example for the simplicity of the question but My real situation are a bit more complex:
I've Web-API that send messages that are needed to be process to a Worker Role using Service Bus Queue.
I need somehow to know when the Worker has been processed the message.
I want the Worker to send a message to the queue alerting that the Web API has processed the message, but now I need to make the Web-API "sit" and wait for a response of the Worker back, that lead me to my question:
How can I listen to a queue constantly and without polling (because there is a lot of instances that will pooling and its will create a lot of requests that maybe its best to avoid.

Use the Azure WebJobs SDK - it has triggers to monitor queues and blobs with simple code like:
public static void Main()
{
JobHost host = new JobHost();
host.RunAndBlock();
}
public static void ProcessQueueMessage([QueueTrigger("webjobsqueue")] string inputText,
[Blob("containername/blobname")]TextWriter writer)
{
writer.WriteLine(inputText);
}
There is a great tutorial at What is the Azure WebJobs SDK.

I eventually used QueueClient.ReceiveAsync In order to wait for message with a parameter of a Timespan.
BrokeredMessage msg = subClient.ReceiveAsync(TimeSpan.FromMinutes(3));
Here is an nice article that explains large parts of the Azure Service Bus link and Service Bus best practice.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string