Azure Logs does not have all data - azure

I have an Azure Function and all calls I can see:
but when I go to "Logs" and try the following query:
traces
| project
timestamp,
message,
operation_Name,
operation_Id,
cloud_RoleName
| where cloud_RoleName =~ 'FunctionDeviceManager' and operation_Name =~ 'FunctionAlertServiceCallback'
| order by timestamp desc
| take 2000
I see the following result:
as we can see, many calls (for example, with id: 95ecc6d554d78fa34534813efb82abba, 29b613056e582666c132de6ff73b2c2e, 29b613056e582666c132de6ff73b2c2e and many others, most of them) are not displayed in the result.
What is wrong?

The invocation log is not based on data in the traces collection. Instead, it is based on request data. You can easily see so by choosing Run query in Application Insights
It runs this query
requests
| project
timestamp,
id,
operation_Name,
success,
resultCode,
duration,
operation_Id,
cloud_RoleName,
invocationId=customDimensions['InvocationId']
| where timestamp > ago(30d)
| where cloud_RoleName =~ 'xxx' and operation_Name =~ 'yyy'
| order by timestamp desc
| take 20
So that explains the difference in the result.
Now, regarding the cause of why the traces collection doesn't always contain data related to the request: per default, all types of telemetry are subject to sampling if not specified in the host.json file, see the docs.
For example, when you create a new http triggered function using Visual Studio 2022 the following host.json is added
{
"version": "2.0",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
}
}
}
As you can see request telemetry is excluded from the types of telemetry being sampled. This can cause the issue you are experiencing: the request collection is not sampled, the traces collection is. Hence some data is not in the result list of your query.

Most likely this is the effect of sampling. Unless you have tweaked your Function App config in host.json some executions are skipped in log. As per MS documentation:
Application Insights has a sampling feature that can protect you from
producing too much telemetry data on completed executions at times of
peak load. When the rate of incoming executions exceeds a specified
threshold, Application Insights starts to randomly ignore some of the
incoming executions. The default setting for maximum number of
executions per second is 20 (five in version 1.x). You can configure
sampling in host.json. Here's an example:
See also: https://learn.microsoft.com/en-us/azure/azure-monitor/app/sampling

Related

How can I run a search job periodically in Azure Log Analytics?

I'm trying to visualize the browser statistics of our app hosted in Azure.
For that I'm using the nginx logs and run an Azure Log Analytics query like this:
ContainerLog
| where LogEntrySource == "stdout" and LogEntry has "nginx"
| extend logEntry=parse_json(LogEntry)
| extend userAgent=parse_user_agent(logEntry.nginx.http_user_agent, "browser")
| extend browser=parse_json(userAgent)
| summarize count=count() by tostring(browser.Browser.Family)
| sort by ['count']
| render piechart with (legend=hidden)
Then I'm getting this diagram, which is exactly what I want:
But the query is very very slow. If I set the time range to more than just the last few hours it takes several minutes or doesn't work at all.
My solution is to use a search job like this:
ContainerLog
| where LogEntrySource == "stdout" and LogEntry has "nginx"
| extend d=parse_json(LogEntry)
| extend user_agent=parse_user_agent(d.nginx.http_user_agent, "browser")
| extend browser=parse_json(user_agent)
It creates a new table BrowserStats_SRCH on which I can do this search query:
BrowserStats_SRCH
| summarize count=count() by tostring(browser.Browser.Family)
| sort by ['count']
| render piechart with (legend=hidden)
This is much faster now and only takes some seconds.
But my problem is, how can I keep this up-to-date? Preferably this search job would run once a day automatically and refreshed the BrowserStats_SRCH table so that new queries on that table run always on the most recent logs. Is this possible? Right now I can't even trigger the search job manually again, because then I get the error "A destination table with this name already exists".
In the end I would like to have a deeplink to the pie chart with the browser stats without the need to do any further click. Any help would be appreciated.
But my problem is, how can I keep this up-to-date? Preferably this search job would run once a day automatically and refreshed the BrowserStats_SRCH table so that new queries on that table run always on the most recent logs. Is this possible?
You can leverage the api to create a search job. Then use a timer triggered azure function or logic app to call that api on a schedule.
PUT https://management.azure.com/subscriptions/00000000-0000-0000-0000-00000000000/resourcegroups/testRG/providers/Microsoft.OperationalInsights/workspaces/testWS/tables/Syslog_suspected_SRCH?api-version=2021-12-01-preview
with a request body containing the query
{
"properties": {
"searchResults": {
"query": "Syslog | where * has 'suspected.exe'",
"limit": 1000,
"startSearchTime": "2020-01-01T00:00:00Z",
"endSearchTime": "2020-01-31T00:00:00Z"
}
}
}
Or you can use the Azure CLI:
az monitor log-analytics workspace table search-job create --subscription ContosoSID --resource-group ContosoRG --workspace-name ContosoWorkspace --name HeartbeatByIp_SRCH --search-query 'Heartbeat | where ComputerIP has "00.000.00.000"' --limit 1500 --start-search-time "2022-01-01T00:00:00.000Z" --end-search-time "2022-01-08T00:00:00.000Z" --no-wait
Right now I can't even trigger the search job manually again, because then I get the error "A destination table with this name already exists".
Before you start the job as described above, remove the old result table using an api call:
DELETE https://management.azure.com/subscriptions/{subscriptionId}/resourcegroups/{resourceGroupName}/providers/Microsoft.OperationalInsights/workspaces/{workspaceName}/tables/{tableName}?api-version=2021-12-01-preview
Optionally, you could check the status of the job using this api before you delete it to make sure it is not InProgress or Deleting

Azure log ingestion TimeGenerated problem

I have an Azure Log Analytics workspace and inside it I created a custom table to ingest some of my logs.
I used these two guides for it (mainly the first one):
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/tutorial-logs-ingestion-portal
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/tutorial-logs-ingestion-api
In my logs I have a field:
"Time": "2023-02-07 11:15:23.926060"
Using DCR, I create a field TimeGenerated like this:
source
| extend TimeGenerated = todatetime(Time)
| project-away Time
Everything works fine, I manage to ingest my data and query it with KQL.
The problem is that I can't ingest data with some older timestamp. If timestamp is current time or close to it, it works fine.
If my timestamp, let's say from two days ago, it overwrites it with current time.
Example of the log I send:
{
"Time": "2023-02-05 11:15:23.926060",
"Source": "VM03",
"Status": 1
}
The log I receive:
{
"TimeGenerated": "2023-02-07 19:35:23.926060",
"Source": "VM03",
"Status": 1
}
Can you tell why is it happening, why can't I ingest logs from several days ago and how to fix. The guides I used do not mention any of the sort at all, regrettably.
I've hit this limit once before, a long long time ago. Asked a question and got a response frome someone working on Application Insights and the response was that only data not older than 48h is ingested.
Nowadays AFAIK the same applies to Log Analytics, I am not sure the same limit of 48 hours stills stands but I think it is fair to assume some limit is still enforced and there is no way around it.
Back in the time I took my loss and worked with recent data only.

How to properly create once-a-day Azure Log Alert for pod errors?

I created an Azure Alert using a Query (KQL - Kusto Query Language) reading from the Log. That is, it's an Log Alert.
After a few minutes, the alert was triggered (as I expected based on my condition).
My condition checks if there are Pods of a specific name in Failed state:
KubePodInventory
| where TimeGenerated between (now(-24h) .. now())
| where ClusterName == 'mycluster'
| where Namespace == 'mynamespace'
| where Name startswith "myjobname"
| where PodStatus == 'Failed'
| where ContainerStatusReason == 'Completed' //or 'Error' or doesn't matter? (there are duplicated entries, one with Completed and one with Error)
| order by TimeGenerated desc
These errors stay in the log, and I only want to catch (alert about them) once per day (that is, I check if there is at least one entry in the log (threshold), then fire the alert).
Is the log query evaluated every time there is a new entry in the log, or is it evaluated in a set frequency?I could not find in Azure Portal a frequency specified to check Alerts, so maybe it evaluates the Alert(s) condition(s) every time there is something new in the Log?

How can I access custom event values from Azure AppInsights analytics?

I am reporting some custom events to Azure, within the custom event is a value being held under the customMeasurements object named 'totalTime'.
The event itself looks like this:
loading-time: {
customMeasurements : {
totalTime: 123
}
}
I'm trying to create a graph of the average total time of all the events reported to azure per hour. So I need to be able to collect and average the values within the events.
I can't seem to figure out how to access the customMeasurements values from within the Azure AppInsights Analytics. Here is some of the code that Azure provided.
union customEvents
| where timestamp between(datetime("2019-11-10T16:00:00.000Z")..datetime("2019-11-11T16:00:00.000Z"))
| where name == "loading-time"
| summarize Ocurrences=count() by bin(timestamp, 1h)
| order by timestamp asc
| render barchart
This code simply counts the number of reported events within the last 24 hours and displays them per hour.
I have tried to access the customMeasurements object held in the event by doing
summarize Occurrences=avg(customMeasurements["totalTime"])
But Azure doesn't like that, so I'm doing it wrong. How can I access the values I require? I can't seem to find any documentation either.
It can be useful to project the data from the customDimensions / customMeasurements property collecton into a new variable that you'll use for further aggregation. You'll normally need to cast the dimensions data to the expected type, using one of the todecimal, toint, tostring functions.
For example, I have some extra measurements on dependency telemetry, so I can do something like so
dependencies
| project ["ResponseCompletionTime"] = todecimal(customMeasurements.ResponseToCompletion), timestamp
| summarize avg(ResponseCompletionTime) by bin(timestamp, 1h)
Your query might look something like,
customEvents
| where timestamp between(datetime("2019-11-10T16:00:00.000Z")..datetime("2019-11-11T16:00:00.000Z"))
| where name == "loading-time"
| project ["TotalTime"] = toint(customMeasurements.totalTime), timestamp
| summarize avg(TotalTime) by bin(timestamp, 1h)
| render barchart

Azure Log Analytics - Alerts Advice

I have a question about azure log analytics alerts, in that I don't quite understand how the time frame works within the context of setting up an alert based on an aggregated value.
I have the code below:
Event | where Source == "EventLog" and EventID == 6008 | project TimeGenerated, Computer | summarize AggregatedValue = count(TimeGenerated) by Computer, bin_at(TimeGenerated,24h, datetime(now()))
For time window : 24/03/2019, 09:46:29 - 25/03/2019, 09:46:29
In the above the alert configuration interface insights on adding the bin_at(TimeGenerated,24h, datetime(now())) so I add the function, passing the arguments for a 24h time period. If you are already adding this then what is the point of the time frame.
Basically the result I am looking for is capturing this event over a 24 hour period and alerting when the event count is over 2. I don't understand why a time window is also necessary on top of this because I just want to run the code every five minutes and alert if it detects more than two instances of this event.
Can anyone help with this?
AFAIK you may use the query something like shown below to accomplish your requirement of capturing the required event over a time period of 24 hour.
Event
| where Source == "EventLog" and EventID == 6008
| where TimeGenerated > ago(24h)
| summarize AggregatedValue= any(EventID) by Computer, bin(TimeGenerated, 1s)
The '1s' in this sample query is the time frame with which we are aggregating and getting the output from Log Analytics workspace repository. For more information, refer https://learn.microsoft.com/en-us/azure/kusto/query/summarizeoperator
And to create an alert, you may have to go to Azure portal -> YOURLOGANALYTICSWORKSPACE -> Monitoring tile -> Alerts -> Manager alert rules -> New alert rule -> Add condition -> Custom log search -> Paste any of the above queries under 'Search query' section -> Type '2' under 'Threshold value' parameter of 'Alert logic' section -> Click 'Done' -> Under 'Action Groups' section, select existing action group or create a new one as explained in the below mentioned article -> Update 'Alert Details' -> Click on 'Create alert rule'.
https://learn.microsoft.com/en-us/azure/azure-monitor/platform/action-groups
Hope this helps!! Cheers!! :)
To answer your question in the comments part, yes the alert insists on adding the bin function and that's the reason I have provided relevant query along with bin function by having '1s' and tried to explain about it in my previous answer.
If you put '1s' in bin function then you would fetch output from Log Analytics by aggregating value of any EventID in the timespan of 1second. So output would look something like shown below where aaaaaaa is considered as a VM name, x is considered as a particular time.
If you put '24h' instead of '1s' in bin function then you would fetch output from Log Analytics by aggregating value of any EventID in the timespan of 24hours. So output would look something like shown below where aaaaaaa is considered as a VM name, x is considered as a particular time.
So in this case, we should not be using '24h' in bin function along with 'any' aggregation because if we use it then we would see only one occurrence of output in 24hours of timespan and that doesn't help you to find out event occurrence count using the above provided query having 'any' for aggregation. Instead you may use 'count' aggregation instead of 'any' if you want to have '24h' in bin function. Then this query would look something like shown below.
Event
| where Source == "EventLog" and EventID == 6008
| where TimeGenerated > ago(24h)
| summarize AggregatedValue= count(EventID) by Computer, bin(TimeGenerated, 24h)
The output of this query would look something like shown below where aaaaaaa is considered as a VM name, x is considered as a particular time, y and z are considered as some numbers.
One other note is, all the above mentioned queries and outputs are in the context of setting up an alert based on an aggregated value i.e., setting up an alert when opting 'metric measurement' under alert logic based on section. In other words, aggregatedvalue column is expected in alert query when you opt 'metric measurement' under alert logic based on section. But when you say 'you get a count of the events' that means If i am not wrong, may be you are opting 'number of results' under alert logic based on section, which would not required any aggregation column in the query.
Hope this clarifies!! Cheers!!

Resources