Our application sends out log4j emails when an Exception is thrown. Because we're doing batch processing, the exception may occur every minute (or more often) until the error is resolved.
Is there a way to set up log4j so that it buffers exceptions and consolidates them? So that we can deuce the number of email alerts that get sent out?
Would also accept a third party service or nagios plugin that can do this sort of consolidation.
I am envisioning that even though exceptions go thrown every minute for an hour, we have some tool, service, or other mechnaism that can consolidate log4j logs (or any application error log that gets emailed out) so that we have more control over alerts.
The goal is to reduce noise of alerts going out to the ops folks. They need to know the alert occurred and is still occurring, but dont need to be spammed every minute.
This works for Logback. There is a plan to include support for log4j.
Have a look at the new Whisper appender: Whisper on Github.
Statutory disclaimer: I'm the author.
Related
I want to emit metrics from my node application to monitor how frequently a certain branch of code is reached. For example, I am interested in knowing how many times a service call didn't return the expected response. Also I want to be able to emit for each service call the time it took etc.
I am expecting I will be using a client in the code that will emit metrics to a server and then I will be able to view the metrics in a dashboard on the server. I am more interested in open source solutions that I can host on my own infrastructure.
Please note, I am not interested in system metrics here such as CPU, memory usage etc.
Implement pervasive logging and then use something like Elasticsearch + Kibana to display them in a dashboard.
There are other metric dashboard systems such as Grafana, Graphite, Tableu etc. A lot of them send metrics which are numbers associated with tags such as counting function calls, CPU load etc. The main reason I like the Kibana solution is that it is not based on metrics but instead extracts metrics from your log files.
The only thing you really need to do with your code is make sure your logs are timestamped.
Google for Kibana or "ELK stack" (ELK stands for Elasticsearch + Logstash + Kibana) for how to set this up. The first time I set it up took me just a few hours to get results.
Node has several loggers that can be configured to send log events to ELK. In addition the Logstash (or the modern "Beats") part of ELK can ingest any log file and parse them with regexp to forward data to Elasticsearch so you do not need to modify your software.
The ELK solution can be configured simply or you can spend literally weeks tuning your data parsing and graphs to get more insights - it is very flexible and how you use it is up to you.
Metrics vs Logs (opinion):
What you want is of course the metrics. But metrics alone doesn't say much. What you are ultimately after is being able to analyse your system for debugging and optimisation. This is where logging has an advantage.
With a solution that extracts metrics from logs like Kibana you have another layer to deep-dive into behind the metrics. You can query it to find what events caused the metrics. This is not easy to do on a running system because you would normally have to simulate inputs to your system to get similar metrics to figure out what is happening. But with Kibana you can analyse historical events that already happened instead!
Here's an old screenshot of a Kibana set-up I did a few years back to monitor a web service (including all emails it receives):
Note the screenshot above - apart from the graphs and metrics I extract from my system I also display parsed logs at the bottom of the dashboard so I get near real-time view of what is happening. This is the email received dashboard which we used to monitor things like subscriptions, complaints, click-through rates etc.
We have a long running ASP.NET WebApp in Azure which has no real endpoints exposed – it serves a single functional purpose primarily reading and manipulating database data, effectively a batched, scheduled task, triggered by a timer every 30 seconds.
The app runs fine most of the time but we are seeing occasional issues where the CPU load for the app goes close to the maximum for the AppServicePlan, instantaneously rather than gradually, and stops executing any more timer triggers and we cannot find anything explicitly in the executing code to account for it (no signs of deadlocks etc. and all code paths have try/catch so there should be no unhandled exceptions). More often than not we see errors getting a connection to a database but it’s not clear if those are cause or symptoms.
Note, this is the only resource within the AppService Plan. The Azure SQL database is in the same region and whilst utilised by other apps is very lightly used by them and they also exhibit none of the issues seen by the problem app.
It feels like this is infrastructure related but we have been unable to find anything to explain what is happening so if anyone has any suggestions for where we should be looking they would be gratefully received. We have enabled basic Application Insights (not SDK) but other than seeing CPU load spike prior to loss of app response there is little information of interest given our limited knowledge of how to best utilise Insights.
According to your description, I thought of two points to troubleshoot your problem. First of all, you can track the running status of your program through the code, and put a log at the beginning and end of your batch scheduled tasks to record the status of each run. If possible, record request and response information and start and end information. This can completely record the time and running status of your task.
Secondly, you can record logs before the program starts database operations, and whether the database connection is successful. The best case is to be able to record, what business will trigger CPU load when operating, and track the specific operating conditions, in order to specifically analyze what causes the database connection failure.
Because you cannot reproduce your problem, you can only guess the cause of the problem. If you still can't find where the problem is through the above two points, then modify your timer appropriately, and let the program trigger once every 5 minutes instead of 30s.
Just checked my Kentico database (Azure hosting) and it ballooned to 21GB. This happened fairly recently since 4 months ago it was just a bit above 1GB.
Checked the tables and my Event Log table has over 2,000,000 entries!!!
Nothing has changed recently, my settings under Settings -> System -> Event Log are still the same:
Event Log Size: 1000
Since globals are also set to 1000, usually I have 2000 or so entries in the event log table.
Anyone knows what happened here? And how to stop it from happening?
If you have online marketing turned on and have a popular site, there will be lots of data in the OM_ tables. But still 20GB sounds huge, are there lots of asset files been added to the Content Tree, like videos? Also, is the database set to log all transactions? What kind of log files are popular? Errors or Information logs? Do you have some custom code that could produce lots of logs?
You can also email Kentico support to get a "check big tables" SQL script which can help you find out which are the large tables.
You should look into a few other areas as well. Having 2MM event log records won't cause a 20GB jump in DB size from a Kentico perspective as the event logs are pretty minimal data.
Take a look in the analytics, version history, email queue, web farm and scheduled task tables. Also check out the recycle bin. Are you integrating with any other system or inserting/updating a lot of data via the API? If so, this could cause a lot of transactional log files to build up. With Azure SQL I don't know of a way to clean those up.
My suggestion is to check other tables and not just the event log. Maybe query the event log manually via SSMS and see what the top 100 events are and that might help you find the problem. If you need to you could probably clear the log too either through the UI or manually truncate the table using SSMS.
I've got a series of services that generate events that are being written to an Azure Event Hub. This hub's connected to a StreamAnalytics Job that takes the event information and writes it to an Azure TableStorage and DataLake Store for later analysis by different teams and tools.
One of my services is reporting all events correctly, but the other isn't, after hooking up a listener to the hub I can see the events are being sent without a problem, but they aren't being processed or sent to the sinks on the job.
On the audit logs I see periodic transformation errors for one of the columns that's written to the storage, but seeing the data there's no problem on the format, and I can't seem to find a way to maybe look at the troubled events that are causing this failures.
The only error I see on the Management Services is
We are experiencing issues writing output for output TSEventStore right now. We will try again soon.
It sounds like there may be two issues:
1) The writing to the TableStorage TSEventStore is failing.
2) There are some data conversion errors.
I would suggest trying to troubleshoot one at a time. For the first one, are there any events being written to the TSEventStore? Is there another message in operations logs that may give more detail on why writing is failing?
For the second one, today we don't have a way to output events that have data conversation errors. The best way is by outputting the data only to one sync (data lake) and looking at it there.
Thanks,
Kati
I've tried to do my homework on this issue but no searches I can make have gotten me closer to the answer. Closest hit was Detect and Delete Orphaned Queues, Topics, or Subscriptions on Azure Service Bus.
My scenario:
I have multiple services running (standard win service). At startup these processes starts to subscribe to a given topic in Azure Service Bus. Let's call the topic "Messages".
When the service is shut down it unsubcribes in a nice way.
But sometimes stuff happens and the service crashes, causing the unsubscription to fail and the subscription then is left hanging.
My questions:
1) From what I'm seeing, each dead topic subscription counts when a message is sent to that topic. Even if no one is ever going to pick it up. Fact or fiction?
2) Is there anyway to remove subscriptions that haven't been checked for a while, for example for the last 24h? Preferrably by a Power Shell script?
I've raised this issue directly with Microsoft but haven't received any answer yet. Surely, I can't be the first to experience this. I'll also update this if I get any third party info.
Thanks
Johan
In the Azure SDK 2.0 release we have addressed this scenario with the AutoDeleteOnIdle feature. This will allow you to set a timespan on a Queue/Topic/Subscription and the when no activity is detected for the specified duration, the entity will automatically be deleted. See details here, and the property to set is here.
On your 1) question, yes messages sent to a topic will be sent to any matching subscription, even if that is Idle (based on your own logic). A subscription is a permanent artifact that you create that is open to receive messages, even when no services are dequeuing messages.
To clean out subscriptions, you can probably use the AccessedAt property of the SubscriptionDescription and use that to check when someone last read the queue (by a Receive operation).
http://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.subscriptiondescription.accessedat.aspx
If you use that logic, you can build your own 'cleansing' mechanisms
HTH