Emitting application level metrics in node js - node.js

I want to emit metrics from my node application to monitor how frequently a certain branch of code is reached. For example, I am interested in knowing how many times a service call didn't return the expected response. Also I want to be able to emit for each service call the time it took etc.
I am expecting I will be using a client in the code that will emit metrics to a server and then I will be able to view the metrics in a dashboard on the server. I am more interested in open source solutions that I can host on my own infrastructure.
Please note, I am not interested in system metrics here such as CPU, memory usage etc.

Implement pervasive logging and then use something like Elasticsearch + Kibana to display them in a dashboard.
There are other metric dashboard systems such as Grafana, Graphite, Tableu etc. A lot of them send metrics which are numbers associated with tags such as counting function calls, CPU load etc. The main reason I like the Kibana solution is that it is not based on metrics but instead extracts metrics from your log files.
The only thing you really need to do with your code is make sure your logs are timestamped.
Google for Kibana or "ELK stack" (ELK stands for Elasticsearch + Logstash + Kibana) for how to set this up. The first time I set it up took me just a few hours to get results.
Node has several loggers that can be configured to send log events to ELK. In addition the Logstash (or the modern "Beats") part of ELK can ingest any log file and parse them with regexp to forward data to Elasticsearch so you do not need to modify your software.
The ELK solution can be configured simply or you can spend literally weeks tuning your data parsing and graphs to get more insights - it is very flexible and how you use it is up to you.
Metrics vs Logs (opinion):
What you want is of course the metrics. But metrics alone doesn't say much. What you are ultimately after is being able to analyse your system for debugging and optimisation. This is where logging has an advantage.
With a solution that extracts metrics from logs like Kibana you have another layer to deep-dive into behind the metrics. You can query it to find what events caused the metrics. This is not easy to do on a running system because you would normally have to simulate inputs to your system to get similar metrics to figure out what is happening. But with Kibana you can analyse historical events that already happened instead!
Here's an old screenshot of a Kibana set-up I did a few years back to monitor a web service (including all emails it receives):
Note the screenshot above - apart from the graphs and metrics I extract from my system I also display parsed logs at the bottom of the dashboard so I get near real-time view of what is happening. This is the email received dashboard which we used to monitor things like subscriptions, complaints, click-through rates etc.

Related

Is it good to create Spark batch job for every new Use cases

I run 100sof computer in a network and 100sof user access those machines. Every day, thousands or more syslogsare generated from all those machines. Syslog could be any log including system failures, network, firewall, application errors etc.
Sample log would look like below
May 11 11:32:40 scrooge SG_child[1829]: [ID 748625 user.info] m:WR-SG-BLOCK-111-
00 c:Y th:BLOCK , no allow rule matched for request with entryurl:http:url on
mapping:bali [ rid:T6zcuH8AAAEAAGxyAqYAAAAQ sid:a6bbd3447766384f3bccc3ca31dbd50n ip:192.24.61.1]
From the logs, I extract fields like Timestamp, loghost, msg, process, facility etc and store them in HDFS. Logsare stored in json format. Now, I want to build a system where I can type a query in a web application and do analysis on those logs. I would like to be able to do queries like
get logs where the message contains "Firewall blocked" keywords.
get logs generated for the User Jason
get logs containing "Access denied" msg.
get log count grouped by user, process, loghost etc.
There could be thousands of different types of analytics I want to do. To add more, I want the combined results of historical data and the real time data i.e. combining batch and realtime results.
Now my questions is
To get the batch result, I need to run the batch spark jobs. Should I
be creating batch jobs for every unique query user makes. If I
do so, I will end up creating 1000s of batch jobs. If not, what kind
of batch jobs should I run so that I can get results for any type of
analytics.
Am I thinking it the right way. If my approach itself is wrong, then do share what should be the correct procedure.
While it's possible (via thrift server for example), Apache Spark main objective is not to be a query engine but building data pipelines for stream and batch data sources.
If the transformation you are only projecting fields and you want to enable ad-hoc queries, sounds like you need another data store - such as ElasticSearch for example. The additional benefit is that it comes with a Kibana that enable analytics to some extent.
Another option is to use a SQL engine such as Apache Drill.
Spark is probably not the right tool to use unless the size of these logs justifies the choice.
Are these logs in the order of a few gigabytes? Then use splunk.
Are these logs in the order of hundreds of gigabytes? Then use elasticsearch with maybe Kibana on top.
Are they in the order of terabytes? Then you should think to some more powerful analytical architecture and there are many alternatives here that basically do batch jobs in the same way as you would with Spark, but usually in a smarter way.

How to take input in logstash?

when should I use filebeat , packetbeat or topbeat ?
I am new to elk stack. I may sound silly but I am really confused over these. I would appreciate any sort of help.
It took me a while but I have figured out the solution.
File beat is used to read input from files we can use it when some application is generating logs in a file like elasticsearch's logs are generated in a log file , so we can use filebeat to read data from log files.
Topbeat is used to visualise the cpu usage , ram usage and other stuffs which are related to system resources.
Packetbeat can be used to analyze network traffic and we can directly log the transactions taking place using the ports on which transactions are happening.
While I was wondering about the difference between logstash and the beats platform it turned out that beats are more lightweight you need not install JVM on each of your servers to use logstash. However , logstash has a rich community of plugins with their count exceeding 200 but beats is still under development , so logstash can be used if we don't have the required protocol support in beats.
These are all Elasticsearch data shippers belonging to Elastic's Beats family. Each beat helps you analyze different bits and pieces in your environment.
Referring specifically to the beats you mentioned:
Filebeat is good for tracking and forwarding specific log files (e.g. apache access log)
Packetbeat is good for network analysis, monitoring the actual data packets being transferred across the wire
Topbeat can be used for infrastructure monitoring, giving you perf metrics on CPU usage, memory, etc.
There are plenty of resources to help you get started. Try Elastic's site. I also saw a series of tutorials on the Logz.io blog.

is there a recommended way of feeding logstash with azure operation logs?

I need to collect Azure operation logs to feed my ELK (elasticsearch, logstash and kibaba) cluster.
I'm looking for a ready-to-use solution. If none is available, I can write my own and in this case I'm looking for a design which is simple and reliable.
My current design is to have a worker role which uses Azure's REST API to fetch logs every minute or so and push log entries to my ELK cluster. Sounds like that will cost U$20/no and I'll have to design some bookkeeping for the periods which my worker role is interrupted.
With so many input options, my hope was that logstash had a plugin for this task.

How to get number of hits in server

I want to create a tool, with which we can administer the server.There are two questions with in this question:
To administer access/hit rate of a server. That is to calculate how many times the server has been accessed from a particular time period and then may be generate some kind of graph to demonstrate the load at a particular time on a particular day.
However i don't have any idea, how i can gather these information.
A pretty vague idea is to
use a watch over access log(in case of apache) and then count the number of times the notification occurs and note down the time simultaneously
Parse access.log file every time and then generate the output(but access.log file can be very big, so not sure about this idea)
I am familiar with apache and hence the above idea is based on apache's access log and i don't have idea about other like nginx etc.
Hence i would like to know, if I can use the above procedure or is there any other way possible.
I would like to know when the server is reaching its limit. The idea of using top and then show the live result of cpu usage and ram usage via CPP
To monitor a web server the easiest way is probably to use some existing tool like webalizer.
http://www.webalizer.org/
To monitor other things like CPU and memory usage I would suggest snmpd together with some other tool like mrtg. http://oss.oetiker.ch/mrtg/
If you think that webalizer does not sample data often enough with its hourly statistics but the sample time of mrtg with 5 minutes would be better it is also possible to provide more data with snmpd by writing an snmpd extension. Such an extension could parse the apache log file with a rather small amount of code and give you all the graphing functionality for free from mrtg or some other tool processing snmp data.

Determining cause of CPU spike in azure

I am relatively new to Azure. I have a website that has been running for a couple of months with not too much traffic...when users are on the system, the various dashboard monitors go up and then flat line the rest of the time. This week, the CPU time when way up when there were no requests and data going in or out of the site. Is there a way to determine the cause of this CPU activity when the site is not active? It doesn't make sense to me that I should have CPU activity being assigned to my site when there is to site activity.
If your website has significant processing at application start, it is possible your VM got rebooted or your app pool recycled and your onstart handler got executed again (which would cause CPU to spike without any request).
You can analyze this by adding application logs to your Application_Start event (but after initializing trace). There is another comment detailing how to enable logging, but you can also consult this link.
You need to collect data to understand what's going on. So first thing I would say is:
1. Go to Azure management portal -> your website (assuming you are using Azure websites) -> dashboard -> operation logs. Try to see whether there is any suspicious activity going on.
download the logs for your site using any ftp client and analyze what's happening. If there is not much data, I would suggest adding more logging in your application to see what is happening or which module is spinning.
A great way to detect CPU spikes and even determine slow running areas of your application is to use a profiler like New Relic. It's a free add on for Azure that collects data and provides you with a dashboard of data. You might find it useful to determine the exact cause of the CPU spike.
We regularly use it to monitor the performance of our applications. I would recommend it.

Resources