How to detect inbound HTTP requests sent anonymously via Tor? - security

I'm developing a website and am sensitive to people screen scraping my data. I'm not worried about scraping one or two pages -- I'm more concerned about someone scraping thousands of pages as the aggregate of that data is much more valuable than a small percentage would be.
I can imagine strategies to block users based on heavy traffic from a single IP address, but the Tor network sets up many circuits that essentially mean a single user's traffic appears to come from different IP addresses over time.
I know that it is possible to detect Tor traffic as when I installed Vidalia with its Firefox extension, google.com presented me with a captcha.
So, how can I detect such requests?
(My website's in ASP.NET MVC 2, but I think any approach used here would be language independent)

I'm developing a website and am
sensitive to people screen scraping my
data
Forget about it. If it's on the web and someone wants it, it will be impossible to stop them from getting it. The more restrictions you put in place, the more you'll risk ruining user experience for legitimate users, who will hopefully be the majority of your audience. It also makes code harder to maintain.
I'll post countermeasures to any ideas future answers propose.

You can check their ip address against a list of Tor Exit Nodes. I know for a fact this won't even slow someone down who is interested in scraping your site. Tor is too slow, most scrapers won't even consider it. There are tens of thousands of open proxy servers that can be easily scanned for or a list can be purchased. Proxy servers are nice because you can thread them or rotate if your request cap gets hit.
Google has been abused by tor users and most of the exit nodes are on Google black list and thats why you are getting a captcha.
Let me be perfectly clear: THERE IS NOTHING YOU CAN DO TO PREVENT SOMEONE FROM SCRAPING YOUR SITE.

By design of the tor network components it is not possible for the receiver to find out if the requester is the original source or if it's just a relayed request.
The behaviour you saw with Google was probably caused by a different security measure. Google detects if a logged-in user changes it's ip and presents a captcha just in case to prevent harmful interception and also allow the continuation of the session if an authenticated user really changed its IP (by re-logon to ISP, etc.).

I know this is old, but I got here from a Google search so I figured I'd get to the root concerns in the question here. I develop web applications, but I also do a ton of abusing and exploiting other peoples. I'm probably the guy you're trying to keep out.
Detecting tor traffic really isn't the route you want to go here. You can detect a good amount of open proxy servers by parsing request headers, but you've got tor, high anonymity proxies, socks proxies, cheap VPNs marketed directly to spammers, botnets and countless other ways to break rate limits. You also
If your main concern is a DDoS effect, don't worry about it. Real DDoS attacks take either muscle or some vulnerability that puts strain on your server. No matter what type of site you have, you're going to be flooded with hits from spiders as well as bad people scanning for exploits. Just a fact of life. In fact, this kind of logic on the server almost never scales well and can be the single point of failure that leaves you open to a real DDoS attack.
This can also be a single point of failure for your end users (including friendly bots). If a legitimate user or customer gets blocked you've got a customer service nightmare and if the wrong crawler gets blocked you're saying goodbye to your search traffic.
If you really don't want anybody grabbing your data, there are some things you can do. If it's a blog content or something, I generally say either don't worry about it or have summary only RSS feeds if you need feeds at all. The danger with scraped blog content is that it's actually pretty easy to take an exact copy of an article, spam links to it and rank it while knocking the original out of the search results. At the same time, because it's so easy people aren't going to put effort into targeting specific sites when they can scrape RSS feeds in bulk.
If your site is more of a service with dynamic content that's a whole other story. I actually scrape a lot of sites like this to "steal" huge amounts of structured proprietary data, but there are options to make it harder. You can limit the request per IP, but that's easy to get around with proxies. For some real protection relatively simple obfuscation goes a long way. If you try to do something like scrape Google results or download videos from YouTube you'll find out there's a lot to reverse engineer. I do both of these, but 99% of people who try fail because they lack the knowledge to do it. They can scrape proxies to get around IP limits but they're not breaking any encryption.
As an example, as far as I remember a Google result page comes with obfuscated javscript that gets injected into the DOM on page load, then some kind of tokens are set so you have to parse them out. Then there's an ajax request with those tokens that returns obfuscated JS or JSON that's decoded to build the results and so on and so on. This isn't hard to do on your end as the developer, but the vast majority of potential thieves can't handle it. Most of the ones that can won't put in the effort. I do this to wrap really valuable services Google but for most other services I just move on to some lower hanging fruit at different providers.
Hope this is useful for anyone coming across it.

I think the focus on how it is 'impossible' to prevent a determined and technically savvy user from scraping a website is given too much significance. #Drew Noakes states that the website contains information that when taken in aggregate has some 'value'. If a website has aggregate data that is readily accessible by unconstrained anonymous users, then yes, preventing scraping may be near 'impossible'.
I would suggest the problem to be solved is not how to prevent users from scraping the aggregate data, but rather what approaches could be used to remove the aggregate data from public access; thereby eliminating the target of the scrapers without the need to do the 'impossible', prevent scrapping.
The aggregate data should be treated like proprietary company information. Proprietary company information in general is not available publicly to anonymous users in an aggregate or raw form. I would argue that the solution to prevent the taking of valuable data would be to restrict and constrain access to the data, not to prevent scrapping of it when it is presented to the user.
1] User accounts/access – no one should ever have access to all the data in a within a given time period (data/domain specific). Users should be able to access data that is relevant to them, but clearly from the question, no user would have a legitimate purpose to query all the aggregate data. Without knowing the specifics of the site, I suspect that a legitimate user may need only some small subset of the data within some time period. Request that significantly exceed typical user needs should be blocked or alternatively throttled, so as to make scraping prohibitively time consuming and the scrapped data potentially stale.
2] Operations teams often monitor metrics to ensure that large distributed and complex systems are healthy. Unfortunately, it becomes very difficult to identify the causes of sporadic and intermittent problems, and often it is even difficult to identify that there is a problem as opposed to normal operational fluctuations. Operations teams often deal with statistical analysed historical data taken from many numerous metrics, and comparing them to current values to help identify significant deviations in system health, be they system up time, load, CPU utilization, etc.
Similarly, requests from users for data in amounts that are significantly greater than the norm could help identify individuals that are likely to be scrapping data; such an approach can even be automated and even extended further to look across multiple accounts for patterns that indicate scrapping. User 1 scrapes 10%, user 2 scrapes the next 10%, user 3 scrapes the next 10%, etc... Patterns like that (and others) could provide strong indicators of malicious use of the system by a single individual or group utilizing multiple accounts
3] Do not make the raw aggregate data directly accessible to end-users. Specifics matter here, but simply put, the data should reside on back end servers, and retrieved utilizing some domain specific API. Again, I assuming that you are not just serving up raw data, but rather responding to user requests for some subsets of the data. For example, if the data you have is detailed population demographics for a particular region, a legitimate end user would be interested in only a subset of that data. For example, an end user may want to know addresses of households with teenagers that reside with both parents in multi-unit housing or data on a specific city or county. Such a request would require the processing of the aggregate data to produce a resultant data set that is of interest to the end-user. It would prohibitively difficult to scrape every resultant data set retrieved from numerous potential permutations of the input query and reconstruct the aggregate data in its entirety. A scraper would also be constrained by the websites security, taking into account the # of requests/time, the total data size of the resultant data set, and other potential markers. A well developed API incorporating domain specific knowledge would be critical in ensuring that the API is comprehensive enough to serve its purpose but not overly general so as to return large raw data dumps.
The incorporation of user accounts in to the site, the establishment of usage baselines for users, the identification and throttling of users (or other mitigation approaches) that deviate significantly from typical usage patterns, and the creation of an interface for requesting processed/digested result sets (vs raw aggregate data) would create significant complexities for malicious individuals intent on stealing your data. It may be impossible to prevent scrapping of website data, but the 'impossibility' is predicated on the aggregate data being readily accessible to the scraper. You can't scrape what you can't see. So unless your aggregate data is raw unprocessed text (for example library e-books) end users should not have access to the raw aggregate data. Even in the library e-book example, significant deviation from acceptable usage patterns such as requesting large number of books in their entirety should be blocked or throttled.

You can detect Tor users using TorDNSEL - https://www.torproject.org/projects/tordnsel.html.en.
You can just use this command-line/library - https://github.com/assafmo/IsTorExit.

Related

"Sandbox" Google Analytics for security

By including Google Analytics in a website (specifically the Javascript version) isn't it true that you are giving Google complete access to all your cookies and site information? (ie. it could be a security hole).
Can this be mitigated by putting Google in an iFrame that is sandboxed? Or maybe only passing Google the necessary information (ie. browser type, screen resolution, etc)?
How can someone get the most out of Google Analytics without leaving the entire site open?
Or perhaps passing the data through my own server and then uploading it to Google?
You can create a scriptless implementation via the measurement protocol (for Universal Analytics enabled properties). This not only avoids any security issues with the script (although I'd rather trust Google on that), it also means you have more control what data is submitted to the Google Server.
A script run on your site can read cookies on your site, yes. And that data can be sent back to google, yes. That is why you shouldn't store sensitive information in cookies. You shouldn't do this even if you don't use google analytics. Even if you don't use ANY other code except your own. Browsers and browser addons can also read that stuff and you definitely cannot control that. Again, never store sensitive information in cookies.
As far as access to "site information".. javascript can be used to read the content on your pages, know urls of pages, etc.. IOW anything you serve up on a web page. Anything that is not behind a wall (e.g. login barrier) is surely up for grabs. But crawlers will look at that stuff anyway. Stuff behind walls can still be grabbed automatically, depending on what they have to actually do to get past those walls (e.g. simple registration/login barriers are pretty easy to get past).
This is also why you should never display sensitive information even in content of your site. E.g. credit card numbers, passwords, etc.. that's why virtually every site you go to that has even remotely sensitive information always shows a mask (e.g. ** ) instead of actual values.
Google Analytics does not actively do these things, but you're right: there's nothing stopping them from doing it, and you've already given them the right to do it by using their script.
And you are right: the safest way to control what Google can actually see is to send server-side requests to them. And also put all your content behind barriers that cannot be easily crawled or scraped. The strongest barrier being one that involves having to pay for access. People are ingenious about making bots about making crawlers and bots to get past all sorts of forms and "human" checks etc.. and you're fighting a losing battle on that count, but nothing stops a bot faster than requiring someone to give you money to access your stuff. Of course, this also means you'd have to make everybody pay for access...
Anyways.. if you're that paranoid about this stuff, why use GA at all? Use something you host yourself (e.g. Piwik). This won't solve for crawlers/bots, obviously, but it will solve for worries about GA grabbing more than you want it to.

How to protect a website from DoS attacks

What is the best methods for protecting a site form DoS attack. Any idea how popular sites/services handles this issue?.
what are the tools/services in application, operating system, networking, hosting levels?.
it would be nice if some one could share their real experience they deal with.
Thanks
Sure you mean DoS not injections? There's not much you can do on a web programming end to prevent them as it's more about tying up connection ports and blocking them at the physical layer than at the application layer (web programming).
In regards to how most companies prevent them is a lot of companies use load balancing and server farms to displace the bandwidth coming in. Also, a lot of smart routers are monitoring activity from IPs and IP ranges to make sure there aren't too many inquiries coming in (and if so performs a block before it hits the server).
Biggest intentional DoS I can think of is woot.com during a woot-off though. I suggest trying wikipedia ( http://en.wikipedia.org/wiki/Denial-of-service_attack#Prevention_and_response ) and see what they have to say about prevention methods.
I've never had to deal with this yet, but a common method involves writing a small piece of code to track IP addresses that are making a large amount of requests in a short amount of time and denying them before processing actually happens.
Many hosting services provide this along with hosting, check with them to see if they do.
I implemented this once in the application layer. We recorded all requests served to our server farms through a service which each machine in the farm could send request information to. We then processed these requests, aggregated by IP address, and automatically flagged any IP address exceeding a threshold of a certain number of requests per time interval. Any request coming from a flagged IP got a standard Captcha response, if they failed too many times, they were banned forever (dangerous if you get a DoS from behind a proxy.) If they proved they were a human the statistics related to their IP were "zeroed."
Well, this is an old one, but people looking to do this might want to look at fail2ban.
http://go2linux.garron.me/linux/2011/05/fail2ban-protect-web-server-http-dos-attack-1084.html
That's more of a serverfault sort of answer, as opposed to building this into your application, but I think it's the sort of problem which is most likely better tackled that way. If the logic for what you want to block is complex, consider having your application just log enough info to base the banning policy action on, rather than trying to put the policy into effect.
Consider also that depending on the web server you use, you might be vulnerable to things like a slow loris attack, and there's nothing you can do about that at a web application level.

Difference between Ad company statistics, Google Analytics and Awstats on adult sites

I have this problem. I have web page with adult content and for several past months i had PPC advertisement on it. And I've noticed a big difference between Ad company statistics of my page, Google Analytics data and Awstats data on my server.
For example, Ad company tells me, that i have 10K pageviews per day, Google Analytics tells me, that i have 15K pageviews and on Awstats it's around 13K pageviews. Which system should I trust? Should i write my own (and reinvent a wheel again)? If so, how? :)
The joke is, that i have another web page, with "normal" content (MMORPG fan site) and those numbers are +- equal in all three systems (ad company, GA, Awstats). Do you think it's because it's not adult oriented page?
And final question, that is totally offtopic, do you know about Ad company that pays per impression and don't mind adult sites?
Thanks for the answers!
First, you should make sure not to mix up »hits«, »files«, »visits« and »unique visits«. They all have a different meaning and are sometimes called differently. I recommend you to look up some definitions if you are confused about the terms.
awstats has probably the most correct statistics, because it has access to the access.log from the web server. Unfortunately, a cached site (maybe cached by the browser, a proxy from an ISP or your own caching server) might not produce a hit on the web server. Especially if your site is served with good caching hints which don't enforce a revalidation and you are running your own web cache (e.g. Squid) in front of your site, the number will be considerable lower, because it only measures the work of the web server.
On the other hand, Google Analytics is only able to count requests from users which haven't blocked Google Analytics and have JavaScript enabled (but they will count pages served by a web cache). So, this count can be influenced by the user, but isn't affected by web caches.
The ad-company is probably simply counting the number of requests which they get from your site (probably based on their access.log). So, to get counted there, the add must not be cached and must not be blocked by the user.
So, as you can see, it's not that easy to get a single correct value. But as long as you use the measured values in comparison to those from the previous months, you should get at least a (nearly) correct rate of growth.
And your porn site probably serves a high amount of static content (e.g. images from the disk) and most of the web servers are really good at serving caching hints automatically for static files. Your MMORPG on the other hand, might mostly consist of some dynamic scripts (PHP?) which don't send any caching hints at all and web servers aren't able to determine those caching headers for dynamic content automatically. That's at least my explanation, without knowing your application and server configuration :)

Detecting login credentials abuse

I am the webmaster for a small, growing industrial association. Soon, I will have to implement a restricted, members-only section for the website.
The problem is that our organization membership both includes big companies as well as amateur “clubs” (it's a relatively new industry…).
It is clear that those clubs will share the login ID they will use to log onto our website. The problem is to detect whether one of their members will share the login credentials with people who would not normally supposed to be accessing the website (there is no objection for such a club to have all it’s members get on the website).
I have thought about logging along with each sign-on the IP address as well as the OS and the browser used; if the OS/Browser stays constant and there are no more than, say, 10 different IP addresses, the account is clearly used by very few different computers.
But if there are 50 OS/Browser combination and 150 different IPs, the credentials have obviously been disseminated far, and there would be then cause for action, such as modifying the password.
Of course, it is extremely annoying when your password is being unilaterally changed. So, for this problem, I thought about allowing the “clubs” to manage their own list of sub-accounts, and therefore if abuse is suspected, the user responsible would be easily pinned-down, and this “sub-member” alone would face the annoyance of a password change.
Question:
What potential problems would anyone see with such an approach?
Any particular reason why you can't force each club member to register (just straight-up, not necessarily as a sub or a similar complex structure)? Perhaps give each club some sort of code to use just when the users register so you can automatically create their accounts and affiliate them with a club, but you then have direct accounting of each member without an onerous process that the club has to manage themselves. Then it's much easier to determine if a given account is being spread around (disparate IP accesses in given periods of time).
Clearly then you can also set a limit on the number of affiliated accounts per club, should you want to do so. This is basically what you've suggested, I suppose, but I would try to keep any onerous management tasks out of the hands of your users if at all possible. If you can manage club-affiliated signups, you should, rather than forcing someone at the club to manage them for you.
Also, while some sort of heuristic based on IP and credentials is probably fine, I would stay away from incorporating user-agent, or at least caring too much about it. Seeing a few different UAs from the same IP - depending on your expected userbase, I suppose - isn't really that unusual. I use several browsers in the course of my day due to website bugs, etc. and unless someone is using a machine as a proxy, it's not evidence of anything nefarious.

Logging requests on high traffic websites

I wonder how high traffic websites handle traffic logging, for example a website like myspace.com receives a lot of hits, I can imagine it would take a lot of space to log all those requests, so, do they log every single request or how do they handle this?
If you view source on a MySpace page, you get the answer:
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-6293770-1");
pageTracker._setDomainName(".myspace.com");
pageTracker._setSampleRate("1"); //sets sampling rate to 1 percent
pageTracker._trackPageview();
</script>
That script means they're using Google Analytics.
They can't just gauge traffic using IIS logs because they may sell ads to third parties, and third parties won't take your word for how much traffic you get. They want independent numbers from a separate company, and that's where Google Analytics comes in.
Just for future reference - whenever you've got a question about how a web site is doing something, try viewing the source. You'd be amazed at what you can find there in plain view.
We had a similar issue with out Intranet which is used by hundreds of people. The disk activity was huge and performance was being hurt.
The short answer is Asynchronous non-blocking logging.
probably like google analytics.
Use Javascript to load a page on a difference server, etc.
Don't how they track it since I don't work there. I am pretty sure that they have enough storage to record every little thing about their user if they wanted.
If I were them, I would use AwStats if I just wanted to know basic stuff about my users.
It is more likely that they have developed their own scripts for tracking their users. Stuff they would log
-ip_address
-referrer
-time
-browser
-OS
and so on. Then a script to see different data about the user varying by day, weeks, or months. As brulak said, something along the line of Analytics, but since they have access to actual database, they can learn much more about their users.
ZXTM traffic shaping and logging, speaking from experience here
I'd be extremely surprised if they didn't log every single request, yes, and operations with particularly high traffic volumes usually roll their own log-management solutions against the raw server logs, in some form or other -- sometimes as simple batch-type processes, sometimes as complete subsystems.
One company I worked for, back in the dot-com heyday, got upwards of twenty million pageviews a day; for that site (actually a set of them, running across a few dozen machines in all, as I recall), our ops team wrote a quite sophisticated, clustered solution in C that parsed, translated (into relational storage), compressed and distributed the logs daily. Log files, especially verbose ones, pile up fast, and the commercial solutions available at the time just couldn't cut it.
If by logging you mean for collecting server related information (request and response times, db and cpu usage per request etc) I think they sample only the 10% or 1% of the traffic. That gives the same results (provide developers with auditing information) without filling in the disks or slowing the site down.

Resources