Azure Function monitor alert where execution count < 1 never triggered - azure

I have an Azure Function App with Azure Functions that I individually want to monitor with the following rule: If an Azure Function didn't execute for N amount of minutes, send out an email/notification.
I am wondering if this is possible with the Application Insights Alerts, which does provide signal logic for the count on an individual Azure Function basis. But this count is never 0, in the graphs it appears that any count < 0 is not seen as a number. It displays as --, as you can see in the graph for my test function below:
testfunction chart (don't have enough reputation to post images)
The peak on the chart is seen as a 3, but if I use the condition "Whenever the testfunction Count is Less than 1" then the alert is never triggered.
Changing the aggregation granularity doesn't really do much, since the signal logic doesn't ever seem to record a count of 0, or any count smaller than 1.
There are lots of (slightly) more inconvenient ways to do this type of monitoring, but it seemed very possible with the nice built-in Azure Application Insights Alerts and I'd like to use that if at all possible.
Am I trying to misuse Application Insights Alerts or is there something obvious that I'm not getting? I would think it should be possible to have monitoring rules based on a lack of executions.

you might have to do this with log/query alerts instead. If you're doing metric based alerts, some of those don't send 0's as data. so if nothing happened during a time range, there's no 0's to alert on, since nothing is submitting 0, 0, 0, 0.
instead, you'd create alerts based on queries: https://learn.microsoft.com/en-us/azure/azure-monitor/platform/alerts-unified-log
the doc has this exact scenario listed:
In some cases, you may want to create an alert in the absence of an event.
For example, a process may log regular events to indicate that it's working properly. If it doesn't log one of these events within a particular time period, then an alert should be created. In this case, you would set the threshold to less than 1. [emphasis added, this is your scenario, correct]?
Example of Number of Records type log alert
Consider a scenario where you want to know when your web-based App gives a response to users with code 500 (that is) Internal Server Error. You would create an alert rule with the following details:
Query: requests | where resultCode == "500"
Time period: 30 minutes
Alert frequency: five minutes
Threshold value: Greater than 0
in that example the query would end up being something like requests | where timespan < ago(30m) | where resultCode == "500" because of the time period set. (the query itself can then filter that time range/result set down however you want)
so for yours, you'd probably just do requests with no where condition at all, and whatever time period and frequency you have, and "less than one" as the threshold.
you could make much more complicated queries as well, to filter out test data, etc.
one thing to watch out for is that I believe log alerts will fire an alert every time the frequency elapses. so if you had a requests < 1 alert set up for every 5 minutes, and your function had no calls for 2 hours, the alert is going to fire every 5 minutes, sending you 40 emails or whatever. maybe you want that :)

Related

How can we change/check the frequency of evaluation of a budget Alert in Azure

I have set up a budget alert using Azure Portal ,I have defined my budget as 400$ and frequency as 120$ .I have received the alert notification by mail .But my concern here is ,that in alert mail the actual value is 240 which is much more than 120 . I want the alert to be triggered right immediately when the value goes above 120(Threshold Value) .
Is there any approach where I can change/check the Frequency of evaluation of Budget Alert defined .
Azure usage data is not updated in real-time
https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/understand-cost-mgt-data#cost-and-usage-data-updates-and-retention
I don't think there is an option to increase the evaluation frequency of the Alert. Even if you could do it, the cost data would be still delayed.

Stream Analytics: How can I start and stop a TUMBLINGWINDOW aggregation job inorder to reduce costs while still getting the same aggregation results?

Context
I have created a streaming job using Azure portal which aggregates data using a day wise TUMBLINGWINDOW. Have attached a code snippet below, modified from the docs, which shows similar logic.
SELECT
DATEADD(day, -1, System.Timestamp()) AS WindowStart
System.Timestamp() AS WindowEnd,
TollId,
COUNT(*)
FROM Input TIMESTAMP BY EntryTime
GROUP BY TumblingWindow(day, 1), TollId
Now that the job has been running and can see it producing output I want to be able to reduce the costs ideally by setting some sort of time scheduling so that the job can run and still produce the same output without being on all the time.
The only real constraint being that the aggregated output at the end of each TUMBLINGWINDOW has to remain the same as if it were running all the time (no impact of stop-starting on output).
This then brings me to my question.
Update: 2021-02-28
Before going into the question another thing that drove me was that through Azure portal you can manually start and stop a job. When you start/restart a job you can set a custom start time for the job/query. With this level of control say I start a job (or have a job running) and then decide to stop it for majority of the day and then turn it on at say 11:30pm each day with a custom start time of midnight of the current day then it would be able to be on for approx 30min before it would output the results (yet still to my understanding produce the same aggregation results/effect compared to if it was on the whole day up until that point). This job could then be paused again at 00:30am ( the next day for which it stays paused for the majority of the day (1380min total until 11:30pm again) upon which the same above logic is applied.
This way it remains off the majority of the day yet still can produce the same output for each day wise window (correct me if I am wrong in my thinking). The only issue with this to me seems to be the fact someone would have to manually perform this. Thus I was driven to the docs looking for a way to automate this.
Question
How can I start and stop a job in an automated fashion such that the required output would still remain intact but so that the job doesn't have to remain on all the time (like it currently is)?
Does the documentation linked above suffice given the context above, if so what are some possible arrangements for the N minutes (on) and M minutes (off) time variables for this to work?
Is this possible given the scenario that I want to aggregate on a one day TUMBLINGWINDOW window (whereby I want each window to start and end at midnight of each day, as per its default behaviour.)?
Eg
Window start: 2022-02-20 00:00:00 Window end: 2022-02-21 00:00:00 (aggregation performed),
Window start: 2022-02-21 00:00:00 Window end: 2022-02-22 00:00:00 (aggregation performed),
Window start: 2022-02-22 00:00:00 Window end: 2022-02-23 00:00:00 (aggregation performed),
....so on
Thoughts
I found this documentation from Microsoft regarding auto-pausing jobs using a few methods
However came across a paragraph (quoted below) which made me doubtful whether it seems reasonable in my particular use case (TUMBLING 1 day window as described in my question section).
Note
There are downsides to auto-pausing a job. The main ones being the loss of the low latency /real time capabilities, and the potential risks from allowing the input event backlog to grow unsupervised while a job is paused. Auto-pausing should not be considered for most production scenarios running at scale.
Could this method
There are 3 ways to lower costs:
downscale your job, you will have higher latency but for a lower cost, up to a point where your job crashes because it runs out of memory over time and/or can't catch up with its backlog. Here you need to keep an eye on your metrics to make sure you can react before it's too late
going further, you can regroup multiple queries into a single job. This job most likely won't be aligned in partitions, so it won't be able to scale linearly (adding SUs is not guaranteed to give you better performance). Same comment as above, plus you need to remember that when you need to scale back up, you probably will have to break down that job into multiple jobs to again be able to scale in a linear fashion
finally you can auto-pause a job, one way to implement that being explained in the doc you linked. I wrote that doc, and what I meant by that comment is that here again you are taking the risk of overloading the job if it can't run long enough to process the backlog of events. This is a risky proposition for most production scenarios
But if you know what you are doing, and are monitoring closely the appropriate metrics (as explained in the doc), this is definitely something you should explore.
Finally, all of these approaches, including the auto-pause one, will deal with tumbling windows transparently for you.
Update: 2022-03-03 following comments here
Update: 2022-03-04 following comments there
There are 3 time dimensions here:
When the job is running or not: the wall clock
When the time window is expected to output results: Tumbling(day,1) -> 00:00AM every day, this is absolute (on the day, on the hour, on the minute...) and independent of the job start time below
What output you want produced from the job, via the job start time
Let's say you have the job running 24/7 for multiple months, and decide to stop it at noon (12:00PM) on the 1st day of March.
It already has generated an output for the last day of February, at 00:00AM Mar1.
You won't see a difference in output until the following day, 00:00AM Mar2, when you expect to see the daily window of Mar1, but it's not output because the job is stopped.
Let's start the job at 01:00AM Mar2 wall clock time. If you want the missing time window, you should either pick a start time at 'when last stopped' (noon the day before), or a custom time any time before 23:59PM Mar1. What you are driving is the output window you want. Here you are telling ASA you want all the windows from that point onward.
ASA will then reload all the data it needs to generate that window (make sure the event hub has enough retention for that, we don't cache data between restarts in the job): Azure Stream Analytics will automatically look back at the data in the input source. For instance, if you start a job “Now” and if your query uses a 5-minutes Tumbling Window, Azure Stream Analytics will seek data from 5 minutes ago in the input. The first possible output event would have a timestamp equal to or greater than the current time, and ASA guarantees that all input events that may logically contribute to the output has been accounted for.

Tracking a counter value in application insights

I'm trying to use application insights to keep track of a counter of number of active streams in my application. I have 2 goals to achieve:
Show the current (or at least recent) number of active streams in a dashboard
Activate a kind of warning if the number exceeds a certain limit.
These streams can be quite long lived, and sometimes brief. So the number can sometimes change say 100 times a second, and sometimes remain unchanged for many hours.
I have been trying to track this active streams count as an application insights metric.
I'm incrementing a counter in my application when a new stream opens, and decrementing when one closes. On each change I use the telemetry client something like this
var myMetric = myTelemetryClient.GetMetric("Metricname");
myMetric.TrackValue(myCount);
When I query my metric values with Kusto, I see that because of these clusters of activity within a 10 sec period, my metric values get aggregated. For the purposes of my alarm, I can live with that, as I can look at the max value of the aggregate. But I can't present a dashboard of the number of active streams, as I have no way of knowing the number of active streams between my measurement points. I know the min value, max and average, but I don't know the last value of the aggregate period, and since it can be somewhere between 0 and 1000, its no help.
So the solution I have doesn't serve my needs, I thought of a couple of changes:
Adding a scheduled pump to my counter component, which will send the current counter value, once every say 5 minutes. But I don't like that I then have to add a thread for each of these counters.
Adding a timer to send the current value once, 5 minutes after the last change. Countdown gets reset each time the counter changes. This has the same problem as above, and does an excessive amount of work to reset the counter when it could be changing thousands of times a second.
In the end, I don't think my needs are all that exotic, so I wonder if I'm using app insights incorrectly.
Is there some way I can change the metric's behavior to suit my purposes? I appreciate that it's pre-aggregating before sending data in order to reduce ingest costs, but it's preventing me from solving a simple problem.
Is a metric even the right way to do this? Are there alternative approaches within app insights?
You can use TrackMetric instead of the GetMetric ceremony to track individual values withouth aggregation. From the docs:
Microsoft.ApplicationInsights.TelemetryClient.TrackMetric is not the preferred method for sending metrics. Metrics should always be pre-aggregated across a time period before being sent. Use one of the GetMetric(..) overloads to get a metric object for accessing SDK pre-aggregation capabilities. If you are implementing your own pre-aggregation logic, you can use the TrackMetric() method to send the resulting aggregates.
But you can also use events as described next:
If your application requires sending a separate telemetry item at every occasion without aggregation across time, you likely have a use case for event telemetry; see TelemetryClient.TrackEvent (Microsoft.ApplicationInsights.DataContracts.EventTelemetry).

Azure Functions scaling and concurrency using Queue triggers and functionAppScaleLimit on the Consumption Plan

I have an Azure Function app on the Linux Consumption Plan that has two queue triggers. Both queue triggers have the batchSize parameter set to 1 because they can both use about 500 MB of memory each and I don't want to exceed the 1.5GB memory limit, so they should only be allowed to pick up one message at a time.
If I want to allow both of these queue triggers to run concurrently, but don't want them to scale beyond that, is setting the functionAppScaleLimit to 2 enough to achieve that?
Edit: added new examples, thank you #Hury Shen for providing the framework for these examples
Please see #Hury Shen's answer below for more details. I've tested three queue trigger scenarios. All use the following legend:
QueueTrigger with no functionAppScaleLimit
QueueTrigger with functionAppScaleLimit set to 2
QueueTrigger with functionAppScaleLimit set to 1
For now, I think I'm going to stick with the last example, but in the future I think I can safely set my functionAppScaleLimit to 2 or 3 if I upgrade to the premium plan. I also am going to test two queue triggers that listen to different storage queues with a functionAppScaleLimit of 2, but I suspect the safest thing for me to do is to create separate Azure Function apps for each queue trigger in that scenario.
Edit 2: add examples for two queue triggers within one function app
Here are the results when using two queue triggers within one Azure Function that are listening on two different storage queues. This is the legend for both queue triggers:
Both queue triggers running concurrently with functionAppScaleLimit set to 2
Both queue triggers running concurrently with functionAppScaleLimit set to 1
In the example where two queue triggers are running concurrently with functionAppScaleLimit set to 2 it looks like the scale limit is not working. Can someone from Microsoft please explain? There is no warning in the official documentation (https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale#limit-scale-out) that this setting is in preview mode, yet we can clearly see that the Azure Function is scaling out to 4 instances when the limit is set to 2. In the following example, it looks like the limit is being respected, but the functionality is not what I want and we still see the waiting that is present in #Hury Shen's answer.
Conclusion
To limit concurrency and control scaling in Azure Functions with queue triggers, you must limit your Azure Function to use one queue trigger per function app and use the batchSize and functionAppScaleLimit settings. You will encounter race conditions and waiting that may lead to timeouts if you use more than one queue trigger.
Yes, you just need to set functionAppScaleLimit to 2. But there are some mechanisms about consumption plan you need to know. I test it in my side with batchSize as 1 and set functionAppScaleLimit to 2(I set WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT as 2 in "Application settings" of function app instead of set functionAppScaleLimit, they are same). And I test with the code below:
import logging
import azure.functions as func
import time
def main(msg: func.QueueMessage) -> None:
logging.info('=========sleep start')
time.sleep(30)
logging.info('=========sleep end')
logging.info('Python queue trigger function processed a queue item: %s',
msg.get_body().decode('utf-8'))
Then I add message to the queue, I sent 10 messages: 111, 222, 333, 444, 555, 666, 777, 888, 999, 000, I sent them one by one. The function was triggered success and after a few minutes, we can see the logs in "Monitor". Click one of the log in "Monitor", we can see the logs show as:
I use 4 red boxes on the right of the screenshot above, I named each of the four logs as "s1", "s2", "s3", "s4"(step 1-4). And summarize the logs in excel for your reference:
I make cells from "s2" to "s4" as yellow because this period of time refer to the execution time of the function task.
According the screenshot of excel, we can infer the following points:
1. The maximum number of instances can only be extended to 2 because we can find it doesn't exist more than two yellow cells in each line of the excel table. So the function can not scale beyond 2 instances as you mentioned in your question.
2. You want to allow both of these queue triggers to run concurrently, it can be implemented. But the instances will be scale out by mechanism of consumption. In simple terms, when one function instance be triggered by one message and hasn't completed the task, and now another message come in, it can not ensure another instance be used. The second message might be waiting on the first instance. We can not control whether another instance is enabled or not.
===============================Update==============================
As I'm not so clear about your description, I'm not sure how many storage queues you want to listen to and how many function apps and QueueTrigger functions you created in your side. I summarize my test result as below for your reference:
1. For your question about would the Maximum Burst you described on the premium plan behave differently than this ? I think if we choose premium plan, the instances will also be scale out with same mechanism of consumption plan.
2. If you have two storage queues need to be listen to, of course we should create two QueueTrigger functions to listen to each storage queue.
3. If you just have one storage queue need to be listen to, I test with three cases(I set max scale instances as 2 in all of three cases):
A) Create one QueueTrigger function in one function app to listen to one storage queue. This is what I test in my original answer, the excel table shows us the instances will scale out by mechanism of consumption plan and we can not control it.
B) Create two QueueTrigger functions in one function app to listen to same storage queue. The result is almost same with case A, we can not control how many instances be used to deal with the messages.
C) Create two function apps and create a QueueTrigger function in each of function app to listen to same storage queue. The result also similar to case A and B, the difference is the max instances can be scaled to 4 because I created two function apps(both of them can scale to 2 instances).
So in a word, I think all of the three cases are similar. Although we choose case 3, create two function apps with one QueueTrigger function in each of them. We also can not make sure the second message be deal with immediately, it still may be processed to first instance and wait for frist instance complete deal with the first message.
So the answer for your current question in this post is setting the functionAppScaleLimit to 2 enough to achieve that? is: If you want both of instances be enabled to run concurrently, we can't make sure of it. If you just want two instances to deal with the messages, the answer is yes.

Azure Monitor Custom log search Query - understanding Period and Frequency

UPDATE:
the actual problem is different from what I've described. I'll provide and update/edit to this ticket once we'll resolve the issue. More details may be found at this thread - https://techcommunity.microsoft.com/t5/Azure-Log-Analytics/Reliably-trigger-alerts-for-Log-Analytics-log-entries/m-p/319315/highlight/false#M1224
Original question:
We use Azure Monitor to create alerts based on logs in Log Analytics. For this we choose our Log Analytics account as a "RESOURCE", then choose "Custom log search" signal name for "CONDITION". Alert logic - "Number of results greater than 0".
Sample query:
search *
| where ResourceProvider == "MICROSOFT.DATAFACTORY" and status_s == "Failed"
For Period and Frequency lets set 15 minutes. All looks simple, but...
The issue: described above setup does not work (it works sometimes), because alerts are fired only sometimes, a lot of them are missed which is completely unacceptable behavior.
If we set Period = Frequency = 5 minutes we basically miss almost every event. Period = Frequency = 15 minutes works better, but still a lot of events are missing. Period = Frequency = 30 works even better, but all this looks weird.
Important notice - logs are collected from Data Factory V2 into Log Analytics. I suspect that alert misses are due to the fact that logs are delivered to Log Analytics with some delay (up to several minutes). So when Azure Monitor evaluates alert query for the last 15 minutes (Period=15) it might be that most resent log entries are still not in Log Analytics. When next query evaluation is executed in 15 minutes it will miss the logs that were ingressed with a delay for prev 15 minutes interval. Is this assumption correct? If so, this is very weird - how then we supposed to configure Period and Frequency values? If I set Period > Frequency (e.g. Period = 30 and Frequency = 5, which means "evaluate expression every 5 minutes, take data for last 30 minutes from current time") then we get multiple duplicated alerts because Period is larger than Frequency so there is a big chance of log search query returning the same log entries every 5 minutes - this is highly undesirable behavior.
Issue happened to be with a buggy bahavior of ARM template creating alerts. Thanks to Stanislav Zhelyazkov it has been nailed down and resolved - I use alternative ARM API now and it seems to work fine. More details on the topic may be found here - https://techcommunity.microsoft.com/t5/Azure-Log-Analytics/Reliably-trigger-alerts-for-Log-Analytics-log-entries/m-p/309610.

Resources