Using Datamining/Statistics for Log Monitoring

Using Datamining/Statistics for Log Monitoring - statistics

I have a large set of log files that I want to characterize or possibly add some kind of decision tree or some kind of analytics. But I don't know exactly what. What kind of analysis have you done with log files, a lot of log files.
For example, so far I am collecting how many requests are made to a particular page for a given log file.
Servlet = 60 requets
Servlet2 = 70 requests, etc.
I guess right there, filter by only the most popular requests. Also, might do something like 60 requests given a 2 hour period. 60 / 160 minutes.

Deciding what analysis to do depends on what decisions you're trying to make based on that analysis. For example, I currently monitor logs for exceptions reported by our application (all exceptions in the client application are logged with the server) to decide what should be high priority client bugs to investigate. I also use log searching software to monitor for any Exceptions reported by our server software which may need more immediate investigation. On top of the logs generated by everything anyway, I also use some monitoring software to track usage of our web server and database server which records usage stats etc. in a database. The final aim of this is to predict future usage levels and purchase more hardware as appropriate to keep up with demand.
Two (free) tools I've been using are:
Hyperic for monitoring, it's pretty easy to set up and might be able to start logging a lot of data you may be interested in, ie requests per second on a web server.
Splunk for searching log files, it's very easy to get set up and work with and gives you excellent searching capabilities over your log files. If you're working with log files right now and haven't tried out splunk I definitely recommend it. I have noticed a couple of moments of 100% cpu whilst using it on our main production server so stopped running it on that machine recently, just a word of warning.
Not sure what your aim is with this analysis, mine has been very much about looking for any errors I should know about, and planning for future capacity needs. If you're interested in the latter I'd also recommend The Art of Capacity Planning.

Related

How to Design a web application to monitor server status in web browser

I just want to try creating a web application to monitor server status. I need some design guidelines.
Should I use some scripting language like Python or ruby to get the stats? Is polling is the only way to do it? If so how frequently should we poll?

If you don't care about data retention, writing a simple web app in ruby or python that polls from the browser would probably be fine. You could alternately use websockets and push data from a CLI-based monitoring agent of some sort that ran in the background on your server.
If you don't care about data fidelity, then you might be able to use something simple like pingdom.
If you do care about data retention and you need lots of custom monitoring, then it's a much harder problem. There are a number of open source projects and paid applications that will solve this problem in various ways. As mentioned in the comment on your post, ganglia could work. You might also look into nagios or munin. If you need app level stats, you could check out statsd/graphite or influxdb/grafana.
If you want server monitoring but don't want to manage additional infrastructure, there are a lot of solutions in the paid space including librato, newrelic, and instrumental.
Note: I am an owner of Instrumental, so I'm biased toward that, but I think your question needs more details to narrow down any recommendations on infrastructure monitoring.

Simulate multiple users using a website

I am developing a website (basically a public facing site).
How can I simulate multiple users are surfing my site and doing various activities so that I can understand how my site will behave in a real time environment?
I am using Apache server and PHP.

As mentioned in previous posts, you will need a load test tool. The good news is that there are many tools and services in this field: open source load test ones like JMeter and Gatling, commercial one like Loadrunner and Neoload, the bad news is that you have to answer some questions and make some decisions.
One key decision you need to make is whether to test your application in the lab or in the cloud.
cloud based testing: Blazemeter, soasta, neustar ...
in-lab testing: JMeter, Gatling, Neoload, loadrunner, webperformer ....
In addition, you need to answer the following questioins:
how many virtual clients you want to emulate to stress the server
how much budget do you have
how complex is the web application
how much skills do the tester(s) have.
If you have high budget, complex web application and testers with good skills (like a developer), you can consider Loadrunner, NeoLoad.
If you have low budget but your tester(s) have good skills, you can consider Jmeter and Gating.
If you need to emulate lots of virtual clients (say 10000) to stress your complex web site and your tester(s) don't have the skill of a developer/programer, you may want to consider NetGend. There is a blog site where you can find out how complex performance testing can be (like filling HTML forms, extract values from JSON messages etc) and how easy it is on NetGend platform. By the way, you don't need high budget for NetGend.
Good lucky in your load/performance testing!

What you want is a load testing tool, there are several but id check out Neoload. You could also use Selenium and the various way to run selenium tests automatically

As this is your first time engaging in this task you would be well advised to find someone who has been there, done that and developed the battle scars from this activity. It is not a trivial effort to performance test a piece of software. If you listen to the traditional software vendors they will tell you that "any business analyst can use this tool and be effective" as if the tool is 85-95% of the skills you will need to be able to successfully performance test an application or a site. This is marketing foo to remove barriers to a sale.
In actuality the tool you select is anywhere from 5-15% of the total skill set you will need to be successful. Also, if the financial risk of failure is sufficiently high to warrant a performance test then it almost matters not which tool you pick, for the cost of the tool and the expertise will be dwarfed by your financial risk of not scaling.
If you don't have time to develop the skills or enough lead time to get a solid performance tester then you may want to consider some of the managed services offerings in the market, such as SOASTA, which can provide the expertise and the tool bundled within the deal. Here are some things you will want to look at in advance of any test (common issues)
Load Balancer misconfiguration resulting in distorted load to one node
Not appropriately managing your cache age for your static resources (.jpg, .css, ...) resulting in higher than expected load
All of your lookup queries to the database should be index optimized. Use a database profiler to check this
Holding onto resources too long. If your 95th percentile page to page request delay is five minutes then don't set your timeout at 30 minutes or 90 minutes for the HTTP session. This will hold onto resources far too long for the dead session. I use a rule of thumb of 95th percentile value times 1.5.
if this is a shopping site then don't hand out a default cart to everyone who shows up. Make sure they are on the revenue path before you hand them a cart, such as looking at the cart or placing something in the cart. Otherwise you have just built a 1:1 relationship with every customer and just about every piece of your architecture from web server to app server to the database server where the cart is created and managed
Also on the cart front, implement a 100x100 rule. If someone has a 100 items in the cart pick up the phone and call them to personalize the sale. If you have persistent carts which never expire then consider implementing a 100 day rule for evacuating from the cart items of that age or killing a cart altogether which hasn't been touched in that period. These people are clearly not on the revenue path.
Consider your design for ecommerce. Every step between add to cart and checkout is an opportunity to abandon the sale. The fewer the steps the greater the conversion rate: This is the genius behind the Amazon one click checkout. Minimize your number of steps and you will see a higher revenue flow as a result.

I would recommend gatling for load testings. It's scala based but a recorder is provided to generate workload test cases.
=> http://gatling-tool.org/

It's important to set some dimensions for test tool assessment to simulate user traffic:
SLA details (performance goals such as pages per second (PPS), http request per second (HPS), throughput, CPU usage etc.)
Virtual user size to simulate (You will need this info to decide number of
slave PCs/VMs, in other words that's called load generators, according to virtual user number
range)
Maintenance cost of scripts over changes
Effort to develop and execute test scripts
Number of test scenarios
Scheduling needs (You may want to schedule tool in regular basis or
execute test on-demand if needed)
Budget for test tool licenses and ROI (return of investment) calculations (Price, tool
expertise cost, utilizaton of test tool on other web applications etc)
Metrics provided by test tools
Monitoring requirements of network, servers and client
Integration with current test infrastructure (If HP ALM exist, you may be interested in Loadrunner)
If you are in hurry and don't have time to evaluate which tool to select, you may start with JMeter.
Selenium can be used for test automation of regression tests, I would like to highlight it's not effective for performance test due to its API. Sahi is another option for test automation.

I think that you are definitely looking for a load testing tool like Blazemeter. I recently discover this webinar which shows you how to do load testing on your application using a PaaS provider as a development and runtime environment, where you can deploy your application to run the load tests. They combine Blazemeter with a monitoring tool, New Relic in this case, to see the way in which you detect the new users surfing on your website. It is really cool and very interesting since you can know what is the performance of your application with a specific infrastructure.

Simple.
Set up server / application monitoring - New Relic is the easiest and most powerful. Free for 14 days.
Record a typical user's activity - Use JMeter to set up a proxy on your laptop and route web requests, mobile app usage etc through it. Sounds difficult but it's really easy. JMeter can act as the man-in-the-middle and capture all the requests sent by the browser/app to the server(s).
Now you "clone" the above user as many times as you need/could and blast the server. Initially you'd run the load test from your dev machine. Mine can take up to 80 concurrent users before cpu/ram runs out. Beyond this level explore BlazeMeter (free 50 concurrent users), jmeter-ec2 script (free), flood.io etc. Upload your script and blast away at your server. Ideally you should run incremental stress tests at your server. 10 users, 50 users, 100 users, 200 users etc.
Analyse, fix issues & ramp up the stress - In between each stress blast, go over your new relic. How is the application & server performing? What's failing? How are the alerts working?

If you are also searching for UI-Testing, you should check out Sahi Pro 6 automated testing tool, it also can be integrated with Jenkins.
==> http://sahipro.com/
It is really easy to record user actions with it on any browser and just playback the recorded scripts.
You can run scripts simultaneously on multiple browsers, thus simulate multiple users browsing your page:
https://sahipro.com/docs/using-sahi/playback-desktop.html#Distributed%20Runs%20-%20More%20Information

what are the tools to parse/analyze IIS logs - ideally free/open source?

note: there are few similar questions already asked here - but they are from 2009. May be something has changed since then.
I'm responsible for a bunch of websites hosted on different servers. I do not do any log analysis right now, but I would like to change this. First question - what is the best tool to view ISSUES with the website based on IIS logs (i.e. 404, 500 responses, long page processing, etc)? Ideally with grouping/sorting options? I do not want to spend a lot of time on this, I just want to periodically check if all is good with the website.
Second question (and I know most likely i'm asking for too much) - but is there any way to expose processed logs to web? So I can review things mentioned above without RPDing into the server?
Ideally I'm looking for a free/open source solution, but I'm ready to pay for a good software as well (but not a lot of $$).
Thank you.

You can take a look at our log monitoring solution EventSentry, which can monitor text-based logs like IIS logs. We have standard templates setup for IIS, and we can consolidate the logs in a database with web-access, so that you can review the logs without using RDP.
It's a pretty flexible solution that allows you to pick the fields you are interested in, and ignore the ones you are not - and thus save space in your database.
You can also setup real-time alerts, so that you can get an email when a critical error is encountered in a log file, like a 500 error.
http://www.eventsentry.com/features/log-file-monitoring
Finally, you can also plug-in command line tools which can verify that a given web page is accessible, or get alerted when it changes: http://www.eventsentry.com/features/application-monitoring.
I'm biased of course, but I would say that our solution is pretty affordable. Since it offers additional functionality as well, such as service monitoring (to monitor your IIS services) and event log monitoring (IIS does log critical messages to the event log), you can setup comprehensive monitoring with a single product.

I'd look into #LuckyLuke solution (or similar) - classic "build vs buy" decision. Based on your post, this isn't going to be your "full time" job so IMHO its best to leave it to those who do...
I don't know what "legacy" answers you are referring to, but if you want to tinker you can use Microsoft's own log parser, and depending on how far you want to go with it, you can use it (COM dll) to write your "admin web pages" in .Net/ASP.Net and host it in each of your servers....
If you're very specific about the errors you just want to be alerted about, another "hacky" way would be to provide your own custom error pages (either the default IIS error pages, or configure your Asp.Net apps to use specific error pages).

What are good ways to create real-time stats for high-load webservers?

Say I have a bunch of webservers each serving 100's of requests/s, and I want to see real time stats like:
Request rate over last 5s, 60s, 5 min etc
Number of unique users seen again per time window
Or in general for a bunch of timestamped events, I want to see real-time derived statistics - what's the best way to go about it?
I've considered having each GET request update a global counter somewhere, then sampling that at various intervals, but at the event rates I'm seeing it's hard to get a distributed counter that's fast enough.
Any ideas welcome!
Added: Servers are Linux running Apache/mod_wsgi, with a Python (Django) stack.
Added: To give a sense of the event rates I want to track stats for, they're coming in at over 10K events/s. Even incrementing a distributed counter at that rate is a challenge.

You might like to help us try out the beta of our agent for application performance monitoring in Python web applications.
http://newrelic.com
It delves more into the application performance rather than just the web server, but since any bottlenecks aren't generate going to be the web server, but your application then that is going to be more useful anyway.
Disclaimer. I work for New Relic and this is the project I am working on. It is a paid product, but the beta means it is free for now with all features. Later when that changes, if you didn't want to pay for it, their is still a Lite subscription level which is free and which gives you basic web metrics reporting which still covers some of what you are after. Anyway, right now would be a great opportunity to make use of it to debug your performance while you can.

Virtually all good servers provide this kind of functionality out of the box. For example, Apache has the mod_status module and Glassfish supports JMX. Furthermore, there are many commercial packages for monitoring clusters, such as Hyperic and Zenoss.
What web or application server are you using? It is difficult to provide a solution without that information.

Look at using WebSockets, their overhead is much smaller than a HTTP request, they are very well suited to real-time web applications. See: http://nodeknockout.com/ for Node based websocket examples.
http://en.wikipedia.org/wiki/WebSocket
You will need to run a daemon if you want to run it on your apache server.
Also take a look at:
http://kaazing.com/ if you wan't less hassle, but are willing to fork out some cash.

On the Windows side, Perfmonance monitor is the tool you should investigate.
As Jared O'Connor said, you should precise what kind of web server you want to monitor.

Logging requests on high traffic websites

I wonder how high traffic websites handle traffic logging, for example a website like myspace.com receives a lot of hits, I can imagine it would take a lot of space to log all those requests, so, do they log every single request or how do they handle this?

If you view source on a MySpace page, you get the answer:
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-6293770-1");
pageTracker._setDomainName(".myspace.com");
pageTracker._setSampleRate("1"); //sets sampling rate to 1 percent
pageTracker._trackPageview();
</script>
That script means they're using Google Analytics.
They can't just gauge traffic using IIS logs because they may sell ads to third parties, and third parties won't take your word for how much traffic you get. They want independent numbers from a separate company, and that's where Google Analytics comes in.
Just for future reference - whenever you've got a question about how a web site is doing something, try viewing the source. You'd be amazed at what you can find there in plain view.

We had a similar issue with out Intranet which is used by hundreds of people. The disk activity was huge and performance was being hurt.
The short answer is Asynchronous non-blocking logging.

probably like google analytics.
Use Javascript to load a page on a difference server, etc.

Don't how they track it since I don't work there. I am pretty sure that they have enough storage to record every little thing about their user if they wanted.
If I were them, I would use AwStats if I just wanted to know basic stuff about my users.
It is more likely that they have developed their own scripts for tracking their users. Stuff they would log
-ip_address
-referrer
-time
-browser
-OS
and so on. Then a script to see different data about the user varying by day, weeks, or months. As brulak said, something along the line of Analytics, but since they have access to actual database, they can learn much more about their users.

ZXTM traffic shaping and logging, speaking from experience here

I'd be extremely surprised if they didn't log every single request, yes, and operations with particularly high traffic volumes usually roll their own log-management solutions against the raw server logs, in some form or other -- sometimes as simple batch-type processes, sometimes as complete subsystems.
One company I worked for, back in the dot-com heyday, got upwards of twenty million pageviews a day; for that site (actually a set of them, running across a few dozen machines in all, as I recall), our ops team wrote a quite sophisticated, clustered solution in C that parsed, translated (into relational storage), compressed and distributed the logs daily. Log files, especially verbose ones, pile up fast, and the commercial solutions available at the time just couldn't cut it.

If by logging you mean for collecting server related information (request and response times, db and cpu usage per request etc) I think they sample only the 10% or 1% of the traffic. That gives the same results (provide developers with auditing information) without filling in the disks or slowing the site down.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string