How long to retain an archive of web server traffic logs?

How long to retain an archive of web server traffic logs? - iis

We've currently got four web servers in a farm generating IIS web logs about 100Mb per day. These can be compressed pretty effieciently down to somewhere around 5% of their size.
We are planning to use waRmZip to move them off the servers and onto a SAN. After a week or so we can be confident we don't have any technical issues to investigate so the only other thing would be using them for trend analysis as a compliment to Google Analytics.
What retention periods do people recommend? Are there any legal requirements to keep this data?

Legal requirements will depend on your country, how much you're logging, and quite possibly the nature of your business. Talk to your company's lawyers - legal advice on SO is likely to be worth what you pay for it.
If you're only storing 5MB per day, you should be able to store them for basically as long as you want without worrying on the technical front.

Please consider the sensitivity of your web log data as well. I have no idea whether access to your web apps would be considered sensitive if made public, but you need to realize that your web logs contain the necessary information to potentially identify individuals (esp. in conjunction with other information available elsewhere). Your privacy policies should reflect how long you retain these logs and what purposes to which they will be put. Google, I think, recently decided to anonymize their logs after 9 months to help protect user privacy. Granted, their situation is a little different since they collect so much information, but you need to consider your customer's needs as well as your own when determining how long and in what form to keep your logs.

I tend to keep mine forever. That's mainly for trend analysis because Google misses some visitors (non-JavaScript ones).

Related

Does GAE offer default quantitative abuse protection?

If I were to say, upload the sample application written in Python, would Google protect me from malicious bots trying to eat up my resources? DoS attacks?
Exactly how much security can I expect from Google?
background:
I've read this article and it looks like you have the option to manually request certain blocks of IP addresses to be blocked. I am not very knowledgeable when it comes to security, but I would have imagined that Google would automatically blacklist suspicious IPs. But then I realized I really didn't know what kind of protection Google did provide, if any, so I thought it might be best to ask.

They will not protect you. You have to manually block the IP's and even that requires a redeploy of code (there's no UI for it).
I'm speaking about this from the experience of a surprise $1000 / week bill on a normally $5 / day app. I had upped the limit to do a major import of data consuming a ton of resources and then not set it back down again. Big mistake. They did give me system credits for less than a third of it, not sure if that was due to this being the day after the billing change (pre-billing change it wouldn't have cost more than the $5 / day) or if it's general policy after a DoS attack.
Even if you have the bill set to be low, they will just stop serving your resources as soon as your bill is eaten up and no warning email will be sent requiring you to use a third-party monitoring service or watch your site 24/7, making the DoSers job much easier.
Bottom line: tread carefully.

Are services like AWS secure enough for an organization that is highly responsible for it's clients privacy?

Okay, so we have to store our clients` private medical records online and also the web site will have a lot of requests, so we have to use some scaling solutions.
We can have our own share of a datacenter and run something like Zend Server Cluster Manager on it, but services like Amazon EC2 look a lot easier to manage, and they are incredibly cheaper too. We just don't know if they are secure enough!
Are they?
Any better solutions?
More info: I know that there is a reference server and it's highly secured and without it, even the decrypted data on the cloud server would be useless. It would be a bunch of meaningless numbers that aren't even linked to each other.
Making the question more clear: Are there any secure storage and process service providers that guarantee there won't be leaks from their side?

First off, you should contact AWS and explain what you're trying to build and the kind of data you deal with. As far as I remember, they have regulations in place to accommodate most if not all the privacy concerns.
E.g., in Germany such thing is a called a "Auftragsdatenvereinbarung". I have no idea how this relates and translates to other countries. AWS offers this.
But no matter if you go with AWS or another cloud computing service, the issue stays the same. And therefor, whatever is possible is probably best answered by a lawyer and based on the hopefully well educated (and expensive) recommendation, I'd go cloud shopping, or maybe not. If you're in the EU, there are a ton of regulations especially in regards to medical records -- some countries add more to it.
From what I remember it's basically required to have end to end encryption when you deal with these things.
Last but not least security also depends on the setup and the application, etc..
For complete and full security, I'd recommend a system that is not connected to the Internet. All others can fail.

You should never outsource highly sensitive data. Your company and only your company should have access to it - in both software and hardware terms. Even if your hoster is generally trusted someone there might just steal hardware.
Depending on the size of your company you should have your custom servers - preferable even unaccessible for the technicans in your datacenter (supposing you don't own the datacenter ;).
So the more important the data is, the less foreign people should have access to it in any means. In the best case you can name all people that have access to them in any way.
(Update: This might not apply to anonymous data, but as you're speaking of customers I don't think that applies here?)
(On a third thought: There're are probably laws to take into consideration of how you have to handle that kind of information ;)

Difference between Ad company statistics, Google Analytics and Awstats on adult sites

I have this problem. I have web page with adult content and for several past months i had PPC advertisement on it. And I've noticed a big difference between Ad company statistics of my page, Google Analytics data and Awstats data on my server.
For example, Ad company tells me, that i have 10K pageviews per day, Google Analytics tells me, that i have 15K pageviews and on Awstats it's around 13K pageviews. Which system should I trust? Should i write my own (and reinvent a wheel again)? If so, how? :)
The joke is, that i have another web page, with "normal" content (MMORPG fan site) and those numbers are +- equal in all three systems (ad company, GA, Awstats). Do you think it's because it's not adult oriented page?
And final question, that is totally offtopic, do you know about Ad company that pays per impression and don't mind adult sites?
Thanks for the answers!

First, you should make sure not to mix up »hits«, »files«, »visits« and »unique visits«. They all have a different meaning and are sometimes called differently. I recommend you to look up some definitions if you are confused about the terms.
awstats has probably the most correct statistics, because it has access to the access.log from the web server. Unfortunately, a cached site (maybe cached by the browser, a proxy from an ISP or your own caching server) might not produce a hit on the web server. Especially if your site is served with good caching hints which don't enforce a revalidation and you are running your own web cache (e.g. Squid) in front of your site, the number will be considerable lower, because it only measures the work of the web server.
On the other hand, Google Analytics is only able to count requests from users which haven't blocked Google Analytics and have JavaScript enabled (but they will count pages served by a web cache). So, this count can be influenced by the user, but isn't affected by web caches.
The ad-company is probably simply counting the number of requests which they get from your site (probably based on their access.log). So, to get counted there, the add must not be cached and must not be blocked by the user.
So, as you can see, it's not that easy to get a single correct value. But as long as you use the measured values in comparison to those from the previous months, you should get at least a (nearly) correct rate of growth.
And your porn site probably serves a high amount of static content (e.g. images from the disk) and most of the web servers are really good at serving caching hints automatically for static files. Your MMORPG on the other hand, might mostly consist of some dynamic scripts (PHP?) which don't send any caching hints at all and web servers aren't able to determine those caching headers for dynamic content automatically. That's at least my explanation, without knowing your application and server configuration :)

SharePoint Governance

I have been search about SharePoint Governance for past few days, the more I search, more confused I am getting about this Topic.
Could anyone just explain in brief? What you know about it or are you using/implementing it?

This is one of those topics that tend to mean something a bit different when you get to the details for different companies, but basically here's my experience with it.
When you talk about governance its about who is responsible for different elements of it, from defining and managing the taxonomy, to the content for different areas. What the process of publication is, for modification is. Who administers different sections, how documents are to be handled/managed/archived/retained, what the auditing policy is. How escalations will be managed, who needs to be notified, who the stakeholders are, etc.
Basically in other words it's about looking at SharePoint as being not just a content storage system but needing to be a coherent enterprise tool with all of the key stakeholders sitting up and taking notice, and recognizing that some things need clear process to avoid either redoing everything in 18 months or being unable to find things.
Some of the most important things that I find ends up getting discussed is security (internal vs. external access, employees vs. contractors vs. external people, logging of who accessed files, etc.), roll back & geographic distribution (for performance), and revocation of rights (when someone is fired, how is their access revocation managed quickly and effectively).
Sorry this is a bit of a shotgun answer, but that's the type of stuff you're looking at. There are actually consultants who have specialized in these areas. It's usually (from what I experience) about 20% technical, 40% change management and 40% process / business.
I hope that helps.

Logging requests on high traffic websites

I wonder how high traffic websites handle traffic logging, for example a website like myspace.com receives a lot of hits, I can imagine it would take a lot of space to log all those requests, so, do they log every single request or how do they handle this?

If you view source on a MySpace page, you get the answer:
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-6293770-1");
pageTracker._setDomainName(".myspace.com");
pageTracker._setSampleRate("1"); //sets sampling rate to 1 percent
pageTracker._trackPageview();
</script>
That script means they're using Google Analytics.
They can't just gauge traffic using IIS logs because they may sell ads to third parties, and third parties won't take your word for how much traffic you get. They want independent numbers from a separate company, and that's where Google Analytics comes in.
Just for future reference - whenever you've got a question about how a web site is doing something, try viewing the source. You'd be amazed at what you can find there in plain view.

We had a similar issue with out Intranet which is used by hundreds of people. The disk activity was huge and performance was being hurt.
The short answer is Asynchronous non-blocking logging.

probably like google analytics.
Use Javascript to load a page on a difference server, etc.

Don't how they track it since I don't work there. I am pretty sure that they have enough storage to record every little thing about their user if they wanted.
If I were them, I would use AwStats if I just wanted to know basic stuff about my users.
It is more likely that they have developed their own scripts for tracking their users. Stuff they would log
-ip_address
-referrer
-time
-browser
-OS
and so on. Then a script to see different data about the user varying by day, weeks, or months. As brulak said, something along the line of Analytics, but since they have access to actual database, they can learn much more about their users.

ZXTM traffic shaping and logging, speaking from experience here

I'd be extremely surprised if they didn't log every single request, yes, and operations with particularly high traffic volumes usually roll their own log-management solutions against the raw server logs, in some form or other -- sometimes as simple batch-type processes, sometimes as complete subsystems.
One company I worked for, back in the dot-com heyday, got upwards of twenty million pageviews a day; for that site (actually a set of them, running across a few dozen machines in all, as I recall), our ops team wrote a quite sophisticated, clustered solution in C that parsed, translated (into relational storage), compressed and distributed the logs daily. Log files, especially verbose ones, pile up fast, and the commercial solutions available at the time just couldn't cut it.

If by logging you mean for collecting server related information (request and response times, db and cpu usage per request etc) I think they sample only the 10% or 1% of the traffic. That gives the same results (provide developers with auditing information) without filling in the disks or slowing the site down.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string