Analyze, anaomaly detection and alert after logs analysis - ELK - logstash

We have logs from multiple applications and vm's streamed to ELK(elastic search, kibana, logstash) stack. This is in our data center, not on the cloud. I am looking for advice on
a) Any mechanism or service that run on top of this ELK stack and understand the application behavior
b) If there is an anomaly, generate the alerts
This system/service needs to understand the application (machine learning capabilities). When I say understand the application, I would like the service to understand my application has high traffic on Wednesday and low traffic on Friday. If on Friday, I am seeing a very high traffic then that is an anomaly.
Another example: If my application has been throwing 20 exceptions a day and now I am aseeing 50 exceptions after the latest version was deployed, this is an anomaly.
b) If there is an anomaly, send the alert via pagerduty or email to all the stakeholders.
This could be a paid service or something that can be plugged in with ELK installation.
I do not want to ships the logs to cloud to get this service.

Machine learning, anomaly detection, and alerting are all features of Elastic's x-pack.

Related

Emitting application level metrics in node js

I want to emit metrics from my node application to monitor how frequently a certain branch of code is reached. For example, I am interested in knowing how many times a service call didn't return the expected response. Also I want to be able to emit for each service call the time it took etc.
I am expecting I will be using a client in the code that will emit metrics to a server and then I will be able to view the metrics in a dashboard on the server. I am more interested in open source solutions that I can host on my own infrastructure.
Please note, I am not interested in system metrics here such as CPU, memory usage etc.
Implement pervasive logging and then use something like Elasticsearch + Kibana to display them in a dashboard.
There are other metric dashboard systems such as Grafana, Graphite, Tableu etc. A lot of them send metrics which are numbers associated with tags such as counting function calls, CPU load etc. The main reason I like the Kibana solution is that it is not based on metrics but instead extracts metrics from your log files.
The only thing you really need to do with your code is make sure your logs are timestamped.
Google for Kibana or "ELK stack" (ELK stands for Elasticsearch + Logstash + Kibana) for how to set this up. The first time I set it up took me just a few hours to get results.
Node has several loggers that can be configured to send log events to ELK. In addition the Logstash (or the modern "Beats") part of ELK can ingest any log file and parse them with regexp to forward data to Elasticsearch so you do not need to modify your software.
The ELK solution can be configured simply or you can spend literally weeks tuning your data parsing and graphs to get more insights - it is very flexible and how you use it is up to you.
Metrics vs Logs (opinion):
What you want is of course the metrics. But metrics alone doesn't say much. What you are ultimately after is being able to analyse your system for debugging and optimisation. This is where logging has an advantage.
With a solution that extracts metrics from logs like Kibana you have another layer to deep-dive into behind the metrics. You can query it to find what events caused the metrics. This is not easy to do on a running system because you would normally have to simulate inputs to your system to get similar metrics to figure out what is happening. But with Kibana you can analyse historical events that already happened instead!
Here's an old screenshot of a Kibana set-up I did a few years back to monitor a web service (including all emails it receives):
Note the screenshot above - apart from the graphs and metrics I extract from my system I also display parsed logs at the bottom of the dashboard so I get near real-time view of what is happening. This is the email received dashboard which we used to monitor things like subscriptions, complaints, click-through rates etc.

Horizontal/Vertical scaling of self hosted integration runtime

We're looking for automated way to horizontally, vertically scale the pull of self hosted integration runtime virtual machines used in ADF.
Reading Microsoft docs does not provide answer.
Well, I don't have the experience, so I can only give you a theoretical answer, but maybe it's helpfull for you.
AFAIK, neither way is configurable out-of-the-box. For scale-out you'll have to deploy an additional IR machine yourself. So probably you'll want to create an image that you can provision from docker or kubernetes and has the IR and pre-requirements installed. The IR installation provides an PowerShell script that can be used to create an automated connection.
For scale-up/down, you'll have to run some script that scales your vm. In an IaaS solution (f.e.) Azure VM, that should be doable with an API call to change your VM.
For both cases you'll have to have some kind of montitor in place that monitors the IR loads and makes changess as needed. I think the measures provided in the Data Factory should do. Maybe you can use Log Analyics to monitor the loads.
I'm curious about your use case for this.
My solution is just for scaling out/in since the VM must be restarted if you are scaling up/down, which causes downtime and job failures etc.
At a high level this solution requires just 3 simple things:
Azure Metric Alert that fires when Scale-Out should occur (VM Start)
Azure Metric Alert that fires when Scale-In should occur (VM Deallocation)
Logic App that is triggered by Azure Alert and actually executes the Start/Stop of the VM, along with any other automation associated with this (eg posting to a Teams channel when Scale in/out occurs)
Here are more of the details surrounding how we setup the conditions for the alerts, but the main thing to keep in mind is (IR CPU %, IR queue length, Number of Nodes, and possibly IR Memory)
Scale-Out
Scale-In
Actions for Alerts
As you can see below we have the alert triggering 1 Logic App, using the payload that is passed to the Logic App, you can determine if the Logic App should be starting the VM, or stopping the VM. (As well as any other additional actions)
Logic App
There is a small chance that due to timing (and depending on how many ADF's the IR is shared to), that pipeline activities could be sent to Node 2 at the same time a deallocation command is sent to the VM for Node 2. I have not seen this as of yet, but adjusting the alert conditions based on your need could help avoid this. Feel free to play around with the conditions of the alerts, granularity, thresholds, etc. This is not a one size fits all solution.

AspNet Core on Azure App Service peaks to 100% CPU and app gateway load balancer not working

We have few of our internal business services hosted on an isolated ASE in Azure.
These services run on a medium app service plan with 2 instances.
This environment has been in production and use for little more than a month now and has been performing fairly well apart from the occasional sudden CPU spike to 100% in one of the instance which bring down the services.
We don't have auto scaling setup but have 2 instances running all the time.
The services are `aspnetcore` webapi and the runtime is dotnet core 2.0.
Every time I have come across this issue in the last couple of weeks I have not been lucky enough to login to kudu and get a process dump to investigate further. The business are literally behind my back to get the service up and running as quick as possible and the easiest route is to restart one of the faulting service or swap slots with a pre-prod environment.
Access to the ASE are also restricted from our network and makes it all the more difficult for me to switch to a WiFi and then go through jump boxes to login to kudu, I had asked our Ops engineer to get me the dump when this issue is reported but he has not been listening to me either, mostly for the same reasons as me not able to do it myself.
All exceptions I can see in Application Insights are due to the service themselves going down and there are no exceptions there which can cause the issue in the first place(at least I've not found it yet)
This lead me to take few guess and look for metrics, the only thing raising my
suspicions is garbage collection. I don't see any sudden spike in GC graphs as well, each time the service is re started the graph is fairly a straight line(24 hours) but increases day by day and ends up like below.
But the working memory is a sinusoid graph letting me think there are no memory leaks. But is the above graph over 3 days normal?
The drop is when I restart the service. But all services have a similar trajectory even the one that has not gone down.
I am not sure if this is a problem with an individual service or an environment configuration I have overlooked.
The API endpoints are simple CRUD operations and publish events to a service bus topic after each operation. There is a static `HttpClient` instance used to fetch data from another service. Apart from that there are no unmanaged resources and the DB connections are always wrapped in `using` statements.
I understand I would need a process dump to investigate further but my biggest concern is why is the application gateway(load balancer) not sending traffic to the healthy instance. Because of the gateway going unhealthy cloudflare returns a `502` response to clients using the api.
MS support haven't been able to help and have not answered if we have our load balancers working correctly.
The average number of requests is about 50-60 per minute.
CPU runs at less than 10% apart this sudden surge.
Thanks
It could be that the backend is pegged at 100% CPU and is unable to respond to Application Gateway health probes. When such an issue occurs, were you able to verify, using Backend health logs, the health state of your backends? If both backend instances were unhealthy, it would explain the 502s. If one of them was healthy and responding to probes, then new requests sent to Application Gateway would indeed flow to the healthy instance. If you suspect that is not the case then please reply back with subscription id, gateway name and approximate time window of incident for us to take a look.

Website going in and out, no activities shown for past 3 days on Azure Web App

Our website (hosted on Azure) has been going in and out the whole morning, works for 5 minutes and then stops loading, then switches back on again. Here's the error message I receive. I've tried restarting the site from Azure Web App a few times and the problem persists.
I've also track activities on Azure dashboard and there is nothing recorded for the last 3 days.
http://i60.tinypic.com/mie6v5.png
Please let me know how to fix this issue, thanks.
P.S.: We have a Standard subscription, and I'm thinking this might be due to the Service Bus - West US and Australia Southeast - Partial Service Interruption as reported on http://azure.microsoft.com/en-us/status/#current
This may be caused because you are using the Free tier which limits CPU to 5 min per hour or Shared which is slightly higher. To keep your site up and responsive, enable Always On in the site's configure tab. You will need to scale your site to the Basic or Standard tier first.
I would also recommend using End Point Monitoring to verify your site is actively handling requests. You can set this in the configure tab as well.
Lastly, I recommend you use the /support tool in Kudu as it provides a richer set of graphs to monitor site activity. To access it type in this URL in your browser, https://mysite.scm.azurewebsites.net/support where mysite is the name of your site. Best of all is this tool can analyze your site as well and help you troubleshoot issues.
Hope that is helpful.

Sudden dropoff in Azure queue performance

Short version: What reasons could there be for a sudden, dramatic, and seemingly permanent increase in the rate of timing-out Azure queue requests?
It's going to be difficult to provide all of the details that could possibly be relevant here, but here's a start:
This is an Azure application (SDK v2.0) with a WCF service placing work requests on a queue (roughly 100k calls a day) and a couple of worker roles which process the queue. We've got New Relic monitoring with the latest .NET agent (3.3.38).
We've run into an issue in our latest release, deployed a few days ago -- after it ran normally for about 24 hours, all of a sudden we started seeing a greatly increased rate of timeouts when our worker roles fetch messages from the queue, along with a catastrophic drop in throughput (our application can now barely keep up with its own queue using 40 workers, whereas it usually gets by with just 2!) Ever since the timeouts started, they show no signs of letting up, keeping up at the same rate since it started happening.
A couple images from New Relic to illustrate:
While this isn't nearly enough information to provide a good answer, I'm just trying to figure out where I might start looking. I've got support tickets open with New Relic and Microsoft, but we're trying to investigate on our own as well. Could this be throttling? Some kind of resource exhaustion in my queue processor worker role? We don't see increased load on the WCF service, and we haven't changed Azure client libraries or changed much of anything in the code that processes the queue.
I suggest you enable analytics on your storage account to determine if the bottleneck is server side or client side/network related. Specifically, you can look at Storage Analytics Metrics table - AverageE2ELatency and AverageServerLatency properties to check if the issue is server side or client side.
You can learn more about Azure storage analytics from links below
Overview:
http://msdn.microsoft.com/en-us/library/hh343270.aspx
How to enable in portal:
http://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
Metrics table Schema:
http://msdn.microsoft.com/en-us/library/hh343264.aspx
Blog post:
http://blogs.msdn.com/b/windowsazurestorage/archive/2011/08/03/windows-azure-storage-analytics.aspx

Resources