There are set of machine hosted in on premise network in my company.
There's one windows service, lets say ServiceX, is running on each of these host.
These ServiceX after each interval polls database.
Before polling the database, I have added following line to emit a metric
TelemetryClient().GetMetric("AgentHeartBeat").TrackValue(1)
So, every time the ServiceX polls, I can see a heart beat custom metric in Azure Application Insights.
Now to setup the alarm I am using following query,
customMetrics
| where name == 'AgentHeartBeat'
| summarize AggregatedValue = avg(valueMin) by bin(timestamp, 1min),
cloud_RoleInstance
Note, cloud_RoleInstance value = Environment.MachineName
Please check below figure for complete Alarm configuration.
To test the alarm I turned of one of machine (hosting ServiceX), but this alarm still not triggered.
I am not sure what exactly I need to change to make it working.
Related
I am trying to add an alert if Azure ML pipeline fails. It looks that one of the ways is to create a monitor in the Azure Portal. The problem is that I cannot find a correct signal name (required when setting up condition), which would identify pipeline fail. What signal name should I use? Or is there another way to send an email if Azure pipeline fails?
What signal name should I use?
You can use PipelineChangeEvent category of AmlPipelineEvent table to view events when ML pipeline draft or endpoint or module are accessed (read, created, or deleted).
For example, according to documentation, use AmlComputeJobEvent to get failed jobs in the last five days:
AmlComputeJobEvent
| where TimeGenerated > ago(5d) and EventType == "JobFailed"
| project TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
Updated answer:
According to Laurynas G:
AmlRunStatusChangedEvent
| where Status == "Failed" or Status == "Canceled"
You can refer to Monitor Azure Machine Learning, Log & view metrics and log files and Troubleshooting machine learning pipelines
Context : app-service in Azure with enabled auto-scale 2 to 8 instances. Usually workload fluctuates between 2..4 instances, and only on rare occasions scaling maxes out to 8 instances - for such cases I want to set up an alert. Let's say, I am interested in all scaling operation above 4 instance count.
Problem : cannot find an alert "scope + condition signal" combination where instance count of auto-scale would be possible to select. Does such data is accessible at all?
And "no" - I do not want to use OOB "Scale out - Notify" functionality, because this feature sends out emails about all scaling operations. Since I am not interested 2..4 instances and only >4 instances, conditioning must be possible.
You can create the alert mechanism for the autoscale operation in web app by projecting the autoscale operation logs to log analytic workspace & followed by creating custom alert.
Here are the steps you need to follow:
Go to Scale-out (App service plan) option in the portal under that Navigate to Diagnostic settings.
Create Diagnostic settings for the autoscale operations & project those logs to log analytics workspace.
Based on the requirement shared above, use the below query to pull scale out operation of a web app with instance count is greater.
AutoscaleScaleActionsLog | where OperationName == 'InstanceScaleAction' and ResultType == "Succeeded"
| where NewInstanceCount >4 and ScaleDirection == 'Increase'
Use the New alert rule option in the log analytics space to create a custom alert & using the above query as signal as shown in below picture.
Here is the sample image of the alert rule that was created using the above query
The above alert query run for every thirty minutes , if there any autoscale operation got recorded it will trigger an email to mentioned recipients.
click on save , enable the alert rule
Here is the sample email output triggered by alert rule
I have a log analytics workspace and 2 VMs connected to it. The VMs do not have Guest-OS enabled.
When I navigate to the Log Analytics --> Log blade and run the Azure provided query for "% Free Space" nothing shows up at all.
Do I need to enable Guest-OS for the VMs ? I thought this metric was out-of-the-box by Azure. What am I missing here ?
More observations:
VM1 and VM2 are connected to the log workspace. I enabled guest-os for VM2 ONLY thinking that this is needed. When I ran this Free Space query with log analytics workspace as the scope, I could see the data for VM1 also which was strange.
So I concluded that Guest-OS is not needed for this metric.
So I removed Guest-OS and removed WADPerformaceCounterTable from the storage too.
And now I dont see ANY data for the query
According to my test, if we want to monitor the servers available disk space using Azure Log Analytics, we need to have the Azure monitor agent installed on the VM’s you want to monitor and enable Performance counters in Azure log analysis.. For further details about it, please refer to the blog.
For example(I use windows VM for test)
Enable the Log Analytics VM Extension. For more details, please refer to here and here
Configuring Performance counters
Query
erf
| where ObjectName == "LogicalDisk" or // the object name used in Windows records
ObjectName == "Logical Disk" // the object name used in Linux records
| where CounterName == "Free Megabytes"
| summarize arg_max(TimeGenerated, *) by InstanceName // arg_max over TimeGenerated returns the latest record
| project TimeGenerated, InstanceName, CounterValue, Computer, _ResourceId
I use an Azure VM for personal purposes and use it mostly like I would use a laptop for checking email etc. However, I have several times forgot to stop the VM when I am done using it and thus have had it run idle for days, if not weeks, resulting in unnecessarily high billing.
I want to set up an email (and if possible also SMS and push notification) alert.
I have looked at the alert function in the advisor, but it does not seem to have enough customization to handle such a specific alert (which would also reduce Microsoft's income!).
Do you know any relatively simple way to set up such an alert?
You can take use of Log Analytics workspaces and Custom log search.
The below are the steps to create an alert, which will send the alert if the azure vm is running exactly 1 hour.
First:
you need to create a Log Analytics workspaces and connect to azure vm as per this link.
Sencod:
1.In azure portal, nav to Azure Monitor -> Alerts -> New alert rule.
2.In the "Create rule" page, for Resource, select the Log Analytics workspaces you created ealier. Screenshot as below:
Then for Condition, please select Custom log search. Screenshot as below:
Then in the Configure signal logic page, in Search query, input the following query:
Heartbeat
| where Computer == "yangtestvm" //this is your azure vm name
| order by TimeGenerated desc
For Alert logic: set Based on as Number of results, set Operator as Equal to, set Threshold value as 60.
For Evaluated based on: set Period as 60, set Frequency as 5.
The screenshot as below:
Note:
for the above settings, I query the Heartbeat table. For azure vm which is running, it always sends data to log analytics to the Heartbeat table per minute. So if I want to check if the azure vm is running exactly 1 hour(means it sends 60 data to Heartbeat table), just use the above query, and set the Threshold value to 60.
Another thing is the Period, it also needs to be set as 1 hour(60 minutes) since I just check if the azure vm is running for 1 hour; for Frequecy, you can set it any value you like.
If you understand what I explains, you can change these values as per your need.
At last, set the other settings for this alert.
Please let me know if you still have more issues about this.
Another option is to use the Azure Activity log to determine if a VM has been running for more than a specified amount of time. The benefit to this approach is that you don't need to enable Diagnostic Logging (Log Analytics), it also supports appliances that can't have an agent installed (i.e. NVAs).
The logic behind this query is to determine if the VM is in a running state, and if so has it been running for more than a specified period of time (MaxUpTime).
This is achieved by getting the most recent event of type 'Start' or 'Deallocate', then checking if this event is of type 'Start' and was generated more than 'MaxUpTime' ago
let DaysOfLogsToCheck = ago(7days);
let MaxUptime = ago(2h); // If the VM has been up for this long we want to know about it
AzureActivity
| where TimeGenerated > DaysOfLogsToCheck
// ActivityStatus == "Succeeded" makes more sense, but in practice it can be out of order, so "Started" is better in the real world
| where OperationName in ("Deallocate Virtual Machine", "Start Virtual Machine") and ActivityStatus == "Started"
// We need to keep only the most recent entry of type 'Deallocate Virtual Machine' or 'Start Virtual Machine'
| top 1 by TimeGenerated desc
// Check if the most recent entry was "Start Virtual Machine" and is older than MaxUpTime
| where OperationName == "Start Virtual Machine" and TimeGenerated <= MaxUptime
| project TimeGenerated, Resource, OperationName, ActivityStatus, ResourceId
I have a Virtual Machine Scale Set (VMSS) with autoscaling rules. I can get the performance metrics of a host but there is no graph for instances count.
There is a graph on VMSS settings "Scaling" -> "Run history", like this.
But how I can get it from Metrics and place on the dashboard?
By default, having a VMSS does not emit anything to Application Insights (AI) unless you configure an app / platform (like Service Fabric for example) to use AI.
So, if you do have software running on the VMSS that emits to AI then you could write an AI analytics query to get the instance count like this:
requests
| summarize dcount(cloud_RoleInstance) by bin(timestamp, 1h)
Typically cloud_RoleInstance contains a VM identifier so that is what I used in the query. It does show the distinct count of VMs.
This only works reliable if the software runs on all VMs in the VMSS and if all VMs emit data to AI at least once an hour. Of course you can adapt the script to your liking / requirements.
operators used:
dcount: counts the unique occurences of the specified field
bin: group results in slots of 1 hour
Thanks Peter Bons, it's that I need!
As I run Docker on the VM I can add OMS agent container and use it's data.
This is what I wanted.
ContainerInventory
| where TimeGenerated >= ago(3h)
| where Name contains "frontend"
| summarize dcount(Computer) by bin(TimeGenerated, 5m)
On Azure portal, navigate to VMSS, select required VMSS -> Scaling under Settings from left-navigation-panel -> Click on 'Run History' tab on right-side-panel
The easy way is after you have gone to the 'Run History' tab just click the 'Pin to Dashboard' button. You can see this button in the image supplied in the question.