Configuring Azure Application Insights with Terraform - azure

I would like to configure Alert Rules in Azure Application Insights. Is this something that can be done using Terraform or i have to do it through the portal?
I would like to be alerted on the below things :
Whenever the average available memory is less than 200 megabyte (Signal Type = Metrics)
Whenever the average process cpu is greater than 80 (Signal Type = Metrics)
Whenever the average server response time is greater than 5000 milliseconds (Signal Type = Metrics)
Whenever the count of failed requests is greater than 5 (Signal Type = Metrics)
Failure Anomalies - prodstats-masterdata-sandbox (Signal Type = Smart Detector)

You can do it using terraform azurerm providers azurerm_monitor_metric_alert and azurerm_monitor_smart_detector_alert_rule.

Related

Name or Service not known - intermittent error in Azure

I have a TimerTrigger which calls my own Azure Functions at a relatively high rate - a few times per second. It is not being stress tested. Every call takes just a 100ms and the purpose of the test is not a stress test.
This call to my own endpoint works about 9999 times out of 10000 but just once in a while I get the following error:
System.Net.Http.HttpRequestException: Name or service not known (app.mycustomdomain.com:443)
---> System.Net.Sockets.SocketException (0xFFFDFFFF): Name or service not known
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
I replaced my actual domain with "app.mycustomdomain.com" in the error message above. It is a custom domain set up to point to the Azure Function App using CNAME.
The Function App does not detect any downtime in the Azure Portal and I have Application Insights enabled and do not see any errors. So I assume the issue is somehow on the callers side and the call never actually happens.
What does this error indicate? And how can I alleviate the problem?
For your second question - alleviating the problem, one option would certainly be to build in retry using a library like Polly. High level you create a policy, e.g. for a simple retry:
var myPolicy = Policy
.Handle<SomeExceptionType>()
.Retry(3);
This would retry 3 times, to use the policy you can call a sync or async version of Execute:
await myPolicy.ExecuteAsync(async () =>
{
//do stuff that might fail up to three times
});
More complete samples are available
This library has lots of support for other approaches, e.g. with delays, exponential delays, etc.

ekg-core/GHC RTS : bogus GC stats when running on Google Cloud Run

I have two services deployed on Google cloud infrastructure; Service 1 runs on Compute Engine and Service 2 on Cloud Run and I'd like to log their memory usage via the ekg-core library (https://hackage.haskell.org/package/ekg-core-0.1.1.7/docs/System-Metrics.html).
The logging bracket is similar to this :
mems <- newStore
registerGcMetrics mems
void $ concurrently io (loop mems)
where
loop ms = do
m <- sampleAll ms
... (lookup the gauges from m and log their values)
threadDelay dt
loop ms
I'm very puzzled by this: both rts.gc.current_bytes_used and rts.gc.max_bytes_used gauges return constant 0 in the case of Service 2 (the Cloud Run one), even though I'm using the same sampling/logging functionality and build options for both services. I should add that the concurrent process in concurrently is a web server, and I expect the base memory load to be around 200 KB, not 0B.
This is about where my knowledge ends; could this behaviour be due to the Google Cloud Run hypervisor ("gVisor") implementing certain syscalls in a non-standard way (gVisor syscall guide : https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/) ?
Thank you for any pointers to guides/manuals/computer wisdom.
Details :
Both are built with these options :
-optl-pthread -optc-Os -threaded -rtsopts -with-rtsopts=-N -with-rtsopts=-T
the only difference is that Service 2 has an additional flag -with-rtsopts=-M2G since Cloud Run services must work with 2 GB of memory at most.
The container OS in both cases is Debian 10.4 ("Buster").
Thinking a bit longer about this, this behaviour is perfectly reasonable in the "serverless" model; resources(both CPU and memory) are throttled down to 0 when the service is not processing requests [1], which is exactly what ekg picks up.
Why logs are printed out even outside of requests is still a bit of a mystery, though ..
[1] https://cloud.google.com/run/docs/reference/container-contract#lifecycle

APP Engine Google Cloud Storage - Error 500 when downloading a file

I'm having an error 500 when I download a JSON file (2MB aprox) using the nodejs-storage library. The file gets downloaded without any problem, but once I render the view and pass the file as parameter the app crashes "The server encountered an error and could not complete your request."
file.download(function(err, contents) {
var messages = JSON.parse(contents);
res.render('_myview.ejs', {
"messages": messages
})
}
I am using the App Engine Standard Environment and have this further error detail:
Exceeded soft memory limit of 256 MB with 282 MB after servicing 11 requests total. Consider setting a larger instance class in app.yaml
Can someone give me hint? Thank you in advance.
500 error messages are quite hard to troubleshoot due to the all the possible scenarios that could go wrong with the App Engine instances. A good way to start debugging this type of errors with App Engine would be to go to the Stackdriver logging, query for the 500 error messages click on the expander arrow and check for the specific error code. In the specific case of the Exceeded soft memory limit... error message in the App Engine Standard environment my suggestion would be to choose an instance class better suited to your application's load.
Assuming you are using automatic scaling you could try to use an F2 instance class (which has a higher Memory and CPU limit than the default F1) and start from there. Adding or modifying the instance_class element of your app.yaml file to instance_class: F2 would suffice to accomplish the instance class suggested, or you could change your app.yaml file to use an instance better suited to your application's load.
Notice that increasing the instance class directly affects your billing and you can use the Google Cloud Platform Pricing Calculator to get an estimate of the costs associated to using a different instance class for your App Engine application.

Function app restarts every hour + 4 minutes

I have a v2 function app written in C# that is deployed to azure. I have application insights monitoring set up to monitor it. I'm looking at the logs to try and diagnose some performance issues and I'm noticing a bunch of messages like this:
Host started (xyz ms)
I see one of these messages every hour + 4 minutes.
7/9/2019, 8:27:04 AM - TRACE
7/9/2019, 7:23:03 AM - TRACE
7/9/2019, 6:19:02 AM - TRACE
7/9/2019, 5:15:03 AM - TRACE
etc.
I have a function that runs on a trigger that I'm using to keep the function alive so I can avoid cold starts, which end up in really slow function calls when it first starts.
[FunctionName("KeepAlive")]
public void Run([TimerTrigger("30 */4 * * * *", RunOnStartup=true)]TimerInfo myTimer, ILogger log)
{
log.LogInformation("Keep Alive");
}
I thought that with this function running every 4 minutes it would prevent my function app from shutting down, but for some reason it is restarting every hour + four minutes. What am I doing wrong?
From the back-end logs of 9th and 10th July, there were no restarts.
All these functions and rest of the function executed successfully without a single failure.
Sta*****Function
Mo*****st
Physical*******List
We have detected that you are running with the default setting of logging sampling enabled for Application Insights. This could cause missing execution logs from your monitor logs.
Enable the application insights logging sampling might lead to:
Timer Trigger executions missing from your monitor logs
Other data log missing
You may just need to adjust the sampling settings to fit your particular monitoring scenario.
Please review this guidance to configure sampling.
Also runOnStartup is enabled. We recommend against setting runOnStartup to true in production.
The function will be invoked when the runtime starts. This might lead to unscheduled executions in the execution list below.
Please check here to disable runOnStartup configuration.

Service Bus Topic/Queue Monitoring for size getting closer to it's limit

We have quite a few Azure service-bus topics/queues in production. Any given topic has a MAX SIZE and its possible to hit that limit due to various reasons no related to load viz. ununsed subscriptions attached to the topic etc.
We had more than one outages when a topic hit it's size limits as we had ununsed subscriptions.
We are looking for fundamental monitoring where
If the size of topic > X% of MAX SIZE we get an email /
notification.
Any topic in production namespace should automatically be added to
monitoring. It's possible for dev to forget to add the monitoring
when adding new topic to the namespace.
While 2. is good to have but having just 1. should also be fine.
Azure service bus has "Metrics" in preview currently and there are bunch of metrics we can setup to get alerted on. It looks like it is in very early stages and even above requirement cannot be fulfilled.
Am I missing something or I need to build custom monitoring using Azure functions / Logic Apps by invoking REST APIs exposed at - https://learn.microsoft.com/en-us/azure/monitoring-and-diagnostics/monitoring-supported-metrics?redirectedfrom=MSDN#microsoftservicebusnamespaces
https://www.servicebus360.com/ is selling the above functionality but my requirement is very rudimentary.
Size of the Queue/Topic is now available in the Azure Monitor Metrics. As it is in preview stage, the values may not reflect instantaneously. But it is possible to monitor those metrics using Azure Monitor, which is also in the Preview stage.
Yes, it is possible to get usage details about Azure Service Bus Queues space usage.
Find below a sample Console Application (C# + .NET Framework 4.7 + WindowsAzure.ServiceBus 4.1.10) that calculates the free space in a given queue. Use TopicDescription for topics.
private static async Task GetFreeSpace(string connectionString, string queueName)
{
if (string.IsNullOrWhiteSpace(connectionString))
{
throw new ArgumentException("Service bus connection string cannot be null, empty or whitespace.");
}
if (string.IsNullOrWhiteSpace(queueName))
{
throw new ArgumentException("Service bus queue name cannot be null, empty or whitespace.");
}
NamespaceManager nm = NamespaceManager.CreateFromConnectionString(connectionString);
QueueDescription queueDescription = await nm.GetQueueAsync(queueName);
double spaceUsedInMB = 0;
double freeSpaceInMB = 0;
double percentageFreeSpace = 100;
if (queueDescription.SizeInBytes > 0)
{
spaceUsedInMB = (queueDescription.SizeInBytes / 1024.0 / 1024.0);
freeSpaceInMB = queueDescription.MaxSizeInMegabytes - spaceUsedInMB;
percentageFreeSpace = 100 * freeSpaceInMB / queueDescription.MaxSizeInMegabytes;
}
Console.WriteLine($"Max Size (MB) = {queueDescription.MaxSizeInMegabytes:0.00000}");
Console.WriteLine($"Used Space (MB) = {spaceUsedInMB:0.00000}");
Console.WriteLine($"Free Space (MB) = {freeSpaceInMB:0.00000}");
Console.WriteLine($"Free Space (%) = {percentageFreeSpace:0.00000}");
}
Here is the packages.config file content:
<?xml version="1.0" encoding="utf-8"?>
<packages>
<package id="WindowsAzure.ServiceBus" version="4.1.10" targetFramework="net47" />
</packages>
This can be automated using a Timer as long as it meets your requirements. Find more details at https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-timer.
In addition, per documentation https://learn.microsoft.com/en-us/powershell/module/azurerm.servicebus/get-azurermservicebusqueue?view=azurermps-6.1.0 it is also possible to get these details using PowerShell.

Resources