How to protect a website from DoS attacks

How to protect a website from DoS attacks - security

What is the best methods for protecting a site form DoS attack. Any idea how popular sites/services handles this issue?.
what are the tools/services in application, operating system, networking, hosting levels?.
it would be nice if some one could share their real experience they deal with.
Thanks

Sure you mean DoS not injections? There's not much you can do on a web programming end to prevent them as it's more about tying up connection ports and blocking them at the physical layer than at the application layer (web programming).
In regards to how most companies prevent them is a lot of companies use load balancing and server farms to displace the bandwidth coming in. Also, a lot of smart routers are monitoring activity from IPs and IP ranges to make sure there aren't too many inquiries coming in (and if so performs a block before it hits the server).
Biggest intentional DoS I can think of is woot.com during a woot-off though. I suggest trying wikipedia ( http://en.wikipedia.org/wiki/Denial-of-service_attack#Prevention_and_response ) and see what they have to say about prevention methods.

I've never had to deal with this yet, but a common method involves writing a small piece of code to track IP addresses that are making a large amount of requests in a short amount of time and denying them before processing actually happens.
Many hosting services provide this along with hosting, check with them to see if they do.

I implemented this once in the application layer. We recorded all requests served to our server farms through a service which each machine in the farm could send request information to. We then processed these requests, aggregated by IP address, and automatically flagged any IP address exceeding a threshold of a certain number of requests per time interval. Any request coming from a flagged IP got a standard Captcha response, if they failed too many times, they were banned forever (dangerous if you get a DoS from behind a proxy.) If they proved they were a human the statistics related to their IP were "zeroed."

Well, this is an old one, but people looking to do this might want to look at fail2ban.
http://go2linux.garron.me/linux/2011/05/fail2ban-protect-web-server-http-dos-attack-1084.html
That's more of a serverfault sort of answer, as opposed to building this into your application, but I think it's the sort of problem which is most likely better tackled that way. If the logic for what you want to block is complex, consider having your application just log enough info to base the banning policy action on, rather than trying to put the policy into effect.
Consider also that depending on the web server you use, you might be vulnerable to things like a slow loris attack, and there's nothing you can do about that at a web application level.

Related

How to detect inbound HTTP requests sent anonymously via Tor?

I'm developing a website and am sensitive to people screen scraping my data. I'm not worried about scraping one or two pages -- I'm more concerned about someone scraping thousands of pages as the aggregate of that data is much more valuable than a small percentage would be.
I can imagine strategies to block users based on heavy traffic from a single IP address, but the Tor network sets up many circuits that essentially mean a single user's traffic appears to come from different IP addresses over time.
I know that it is possible to detect Tor traffic as when I installed Vidalia with its Firefox extension, google.com presented me with a captcha.
So, how can I detect such requests?
(My website's in ASP.NET MVC 2, but I think any approach used here would be language independent)

I'm developing a website and am
sensitive to people screen scraping my
data
Forget about it. If it's on the web and someone wants it, it will be impossible to stop them from getting it. The more restrictions you put in place, the more you'll risk ruining user experience for legitimate users, who will hopefully be the majority of your audience. It also makes code harder to maintain.
I'll post countermeasures to any ideas future answers propose.

You can check their ip address against a list of Tor Exit Nodes. I know for a fact this won't even slow someone down who is interested in scraping your site. Tor is too slow, most scrapers won't even consider it. There are tens of thousands of open proxy servers that can be easily scanned for or a list can be purchased. Proxy servers are nice because you can thread them or rotate if your request cap gets hit.
Google has been abused by tor users and most of the exit nodes are on Google black list and thats why you are getting a captcha.
Let me be perfectly clear: THERE IS NOTHING YOU CAN DO TO PREVENT SOMEONE FROM SCRAPING YOUR SITE.

By design of the tor network components it is not possible for the receiver to find out if the requester is the original source or if it's just a relayed request.
The behaviour you saw with Google was probably caused by a different security measure. Google detects if a logged-in user changes it's ip and presents a captcha just in case to prevent harmful interception and also allow the continuation of the session if an authenticated user really changed its IP (by re-logon to ISP, etc.).

I know this is old, but I got here from a Google search so I figured I'd get to the root concerns in the question here. I develop web applications, but I also do a ton of abusing and exploiting other peoples. I'm probably the guy you're trying to keep out.
Detecting tor traffic really isn't the route you want to go here. You can detect a good amount of open proxy servers by parsing request headers, but you've got tor, high anonymity proxies, socks proxies, cheap VPNs marketed directly to spammers, botnets and countless other ways to break rate limits. You also
If your main concern is a DDoS effect, don't worry about it. Real DDoS attacks take either muscle or some vulnerability that puts strain on your server. No matter what type of site you have, you're going to be flooded with hits from spiders as well as bad people scanning for exploits. Just a fact of life. In fact, this kind of logic on the server almost never scales well and can be the single point of failure that leaves you open to a real DDoS attack.
This can also be a single point of failure for your end users (including friendly bots). If a legitimate user or customer gets blocked you've got a customer service nightmare and if the wrong crawler gets blocked you're saying goodbye to your search traffic.
If you really don't want anybody grabbing your data, there are some things you can do. If it's a blog content or something, I generally say either don't worry about it or have summary only RSS feeds if you need feeds at all. The danger with scraped blog content is that it's actually pretty easy to take an exact copy of an article, spam links to it and rank it while knocking the original out of the search results. At the same time, because it's so easy people aren't going to put effort into targeting specific sites when they can scrape RSS feeds in bulk.
If your site is more of a service with dynamic content that's a whole other story. I actually scrape a lot of sites like this to "steal" huge amounts of structured proprietary data, but there are options to make it harder. You can limit the request per IP, but that's easy to get around with proxies. For some real protection relatively simple obfuscation goes a long way. If you try to do something like scrape Google results or download videos from YouTube you'll find out there's a lot to reverse engineer. I do both of these, but 99% of people who try fail because they lack the knowledge to do it. They can scrape proxies to get around IP limits but they're not breaking any encryption.
As an example, as far as I remember a Google result page comes with obfuscated javscript that gets injected into the DOM on page load, then some kind of tokens are set so you have to parse them out. Then there's an ajax request with those tokens that returns obfuscated JS or JSON that's decoded to build the results and so on and so on. This isn't hard to do on your end as the developer, but the vast majority of potential thieves can't handle it. Most of the ones that can won't put in the effort. I do this to wrap really valuable services Google but for most other services I just move on to some lower hanging fruit at different providers.
Hope this is useful for anyone coming across it.

I think the focus on how it is 'impossible' to prevent a determined and technically savvy user from scraping a website is given too much significance. #Drew Noakes states that the website contains information that when taken in aggregate has some 'value'. If a website has aggregate data that is readily accessible by unconstrained anonymous users, then yes, preventing scraping may be near 'impossible'.
I would suggest the problem to be solved is not how to prevent users from scraping the aggregate data, but rather what approaches could be used to remove the aggregate data from public access; thereby eliminating the target of the scrapers without the need to do the 'impossible', prevent scrapping.
The aggregate data should be treated like proprietary company information. Proprietary company information in general is not available publicly to anonymous users in an aggregate or raw form. I would argue that the solution to prevent the taking of valuable data would be to restrict and constrain access to the data, not to prevent scrapping of it when it is presented to the user.
1] User accounts/access – no one should ever have access to all the data in a within a given time period (data/domain specific). Users should be able to access data that is relevant to them, but clearly from the question, no user would have a legitimate purpose to query all the aggregate data. Without knowing the specifics of the site, I suspect that a legitimate user may need only some small subset of the data within some time period. Request that significantly exceed typical user needs should be blocked or alternatively throttled, so as to make scraping prohibitively time consuming and the scrapped data potentially stale.
2] Operations teams often monitor metrics to ensure that large distributed and complex systems are healthy. Unfortunately, it becomes very difficult to identify the causes of sporadic and intermittent problems, and often it is even difficult to identify that there is a problem as opposed to normal operational fluctuations. Operations teams often deal with statistical analysed historical data taken from many numerous metrics, and comparing them to current values to help identify significant deviations in system health, be they system up time, load, CPU utilization, etc.
Similarly, requests from users for data in amounts that are significantly greater than the norm could help identify individuals that are likely to be scrapping data; such an approach can even be automated and even extended further to look across multiple accounts for patterns that indicate scrapping. User 1 scrapes 10%, user 2 scrapes the next 10%, user 3 scrapes the next 10%, etc... Patterns like that (and others) could provide strong indicators of malicious use of the system by a single individual or group utilizing multiple accounts
3] Do not make the raw aggregate data directly accessible to end-users. Specifics matter here, but simply put, the data should reside on back end servers, and retrieved utilizing some domain specific API. Again, I assuming that you are not just serving up raw data, but rather responding to user requests for some subsets of the data. For example, if the data you have is detailed population demographics for a particular region, a legitimate end user would be interested in only a subset of that data. For example, an end user may want to know addresses of households with teenagers that reside with both parents in multi-unit housing or data on a specific city or county. Such a request would require the processing of the aggregate data to produce a resultant data set that is of interest to the end-user. It would prohibitively difficult to scrape every resultant data set retrieved from numerous potential permutations of the input query and reconstruct the aggregate data in its entirety. A scraper would also be constrained by the websites security, taking into account the # of requests/time, the total data size of the resultant data set, and other potential markers. A well developed API incorporating domain specific knowledge would be critical in ensuring that the API is comprehensive enough to serve its purpose but not overly general so as to return large raw data dumps.
The incorporation of user accounts in to the site, the establishment of usage baselines for users, the identification and throttling of users (or other mitigation approaches) that deviate significantly from typical usage patterns, and the creation of an interface for requesting processed/digested result sets (vs raw aggregate data) would create significant complexities for malicious individuals intent on stealing your data. It may be impossible to prevent scrapping of website data, but the 'impossibility' is predicated on the aggregate data being readily accessible to the scraper. You can't scrape what you can't see. So unless your aggregate data is raw unprocessed text (for example library e-books) end users should not have access to the raw aggregate data. Even in the library e-book example, significant deviation from acceptable usage patterns such as requesting large number of books in their entirety should be blocked or throttled.

You can detect Tor users using TorDNSEL - https://www.torproject.org/projects/tordnsel.html.en.
You can just use this command-line/library - https://github.com/assafmo/IsTorExit.

Firewalls preventing product activation

I'm looking to implement a basic product activation scheme such that when the program is launched it will contact our server via http to complete the activation. I'm wondering if it is a big problem (especially with bigger companies or educational organizations) that firewalls will block the outgoing http request and prevent activation. Any idea how big as issue this may be?

In my experience when HTTP traffic is blocked by a hardware firewall then there is more often than not a proxy server which is used to browse the internet. Therefore it is good practice to allow the user to enter proxy and authentication details.
The amount of times I have seen applications fail due to not using a corporate proxy server and therefore being blocked by the firewall astonishes me.

there are personal software solutions to purposely block outgoing connections. Check out little snitch. This program can set up rules that explicitly block your computer from making connections to certain domains, IP's and / or Ports. A common use for this program is to stop one's computer from "phoning home" to an activation server.

I can't tell you how prevalent this will be, sorry. But I can give you one data point.
In this company Internet access is granted on an as needed basis. There is one product I have had to support which is wonderful for its purpose and reasonably priced, but I will never approve its purchase again - the licensing is too much of a hassle to be worth it.

I'd say that it may not be common, but if any one of your customers is a business it's likely that you will encounter someone who tryes to run your software behind a restricted internet connection or a proxy. Your software will need to handle this situation, otherwise you will ahve a pissed off customer who cannot use your product, and you will lose the sale for sure.
If you are looking for a third party tool, I've used InstallKey (www.lomacons.com) for product activations. This thing has functionaility that allows for validating with and without an internet connection.

Why ww2 sub domains?

I have seen on the web some domain names having prefix of ww2 or ww3 or so (ww2.somedomain.example, ww3.yourdomain.example). And these happen mostly when traveling from a page to page. What would be the reason of having such subdomains? Is there anything special about them or are they just another sub domain? I mean, are they useful in any particular context?

People running large(-ish) sites used to do this when they needed to break up the load between more than one server. One machine would be called www then the next one would be called www2, etc.
Today, much better load balancing solutions are available that don't require you to expose your internal machine naming conventions to the browser clients.

Technically, the initials before the primary domain name (e.g. the "mail" in mail.yahoo.com) can be best though of as a machine name, identifying the web server/mail server, whatever. They can also identify a group of machines (a web farm).
So the person building up that machine can call it anything they want. The initials www are a (somewhat arbitrary) convention.

Oftentimes, ww{x} is used to indicate a particular server of a set of mirrored servers. If properly configured, I could have www.mydomain.example point to my web site on a load balancer, while I could use ww1, ww2, ww3, etc to access the site guaranteed from a specific LBed server.

I can see 3 possibilities
make the browser load resources more faster. the browser would open a fixed number of connection to same domain not to load the server
they are using more then one server so they can share the load between servers
separate some content to a separate virtual host or server. some kind of organization ...

As various answers have pointed out, modern day load-balancers can balance load without having to resort to using different sub-domains for each machine. However, there is still one benefit of dividing your site into various sub-domains: maximize browser connections.
All browsers limit the number of concurrent connections to a particular host (6 for most modern browsers). If a page contains lots of assets, page-load would be slow as the browser queue those requests because of connection limit. By loading different assets from different subdomain, you get around the connection limit, speeding up page-load.

Typically it's a partitioning strategy. When sites get sufficiently large that they can't run (or run well) on a single server you then have to look at solutions for scaling the application out horizontally (ie more servers) rather than vertically (ie bigger servers).
Some example partitioning strategies are:
Certain users always use certain servers. This can be arbitrary or based on some criteria (user type, geographic location, etc);
When a user gets a session that session is assigned to a particular server (sometimes called "sticky sessions" although this can also be used where such different machines are transparent); and
Certain activities are always on certain machines.

Another common case is organizational reasons. In an extremely large company, www might be for their main marketing website. And, ww2 might be, say, for product documentation pages.
In an ideal world, all departments would share perfectly. In practise, a big company might have their (www) marketing pages managed by an external agency. Their internal (ww2) pages done by their internal team. Often, the marketing agency just doesn't update pages quickly or refuses to run certain stacks, may be too limiting in terms of bureaucratic needs.
The marketing agency may insist on controlling the www and not sharing due to past situations where a company website went down due to internal reasons and yet the agency got blamed, or vice versa.
So, theoretically, there's no need to do this with modern load balancing and such. But, in practise, it can be a lot cheaper, straightforward and allow better business productivity.

Cross-colo fail-over design, DNS level fail-over?

I'm interested in cross-colo fail-over strategies for web applications, such that if the main site fails users seamlessly land at the fail-over site in another colo.
The application side of things looks to be mostly figured out with a master-slave database setup between the colos and services designed to recover and be able to pick up mid-stream. I'm trying to figure out the strategy for moving traffic from the main site to the fail-over site. DNS failover, even with low TTLs, seems to carry a fair bit of latency.
What strategies would you recommend for quickly moving traffic between colos, assuming the servers at the main colo are unreachable?
If you have other interesting experience / words of wisdom about cross-colo failover I'd love to hear those as well.

DNS based mechanisms are troublesome, even if you put low TTLs in your zone files.
The reason for this is that many applications (e.g. MSIE) maintain their own caches which ignore the TTL. Other software will do a single gethostbyname() or equivalent call and store the result until the program is restarted.
Worse still, many ISPs' recursive DNS servers are known to ignore TTLs below their own preferred minimum and impose their own higher TTLs.
Ultimately if the site is to run from both data centers without changing its IP address then you need to look at arrangements for "Multihoming" via global BGP4 route announcements.
With multihoming you need to get at least a /24 netblock of "provider independent" (aka "PI") IP address space, and then have that only be announced to the global routing table from the backup site if the main site goes offline.

As for DNS, I like to reference, "Why DNS Based Global Server Load Balancing Doesn't Work". For everything else -- use BGP.
Designing networks in order to load balance using BGP is still not an easy task and I myself certainly am not an expert on this. It's also more complex than Wikipedia can tell you but there are a couple interesting articles on the web that detail how it can be done:
Load Balancing In BGP Networks
Load Sharing in Single and Multi homed environments
There is always more if you search for BGP and load balancing. There are also a couple whitepapers on the net which describe how Akamai does their global loadbalancing (I believe it's BGP too.), which is always interesting to read and learn about.
Beyond the obvious concepts you can use software and hardware to achieve, you might also want to check with your ISP/provider/colo if they can set you up.
Also, no offense in regard to your choice of colo (Who's the provider?), but most places should be setup to deal with downtimes and so on, they should not require you to take actions. Of course floods or aliens can always strike, but in that case I guess there are more important issues. :-)

If you can, Multicast - http://en.wikipedia.org/wiki/Multicast or AnyCast - http://en.wikipedia.org/wiki/Anycast

Dynamic IP-based blacklisting

Folks, we all know that IP blacklisting doesn't work - spammers can come in through a proxy, plus, legitimate users might get affected... That said, blacklisting seems to me to be an efficient mechanism to stop a persistent attacker, given that the actual list of IP's is determined dynamically, based on application's feedback and user behavior.
For example:
- someone trying to brute-force your login screen
- a poorly written bot issues very strange HTTP requests to your site
- a script-kiddie uses a scanner to look for vulnerabilities in your app
I'm wondering if the following mechanism would work, and if so, do you know if there are any tools that do it:
In a web application, developer has a hook to report an "offense". An offense can be minor (invalid password) and it would take dozens of such offenses to get blacklisted; or it can be major, and a couple of such offenses in a 24-hour period kicks you out.
Some form of a web-server-level block kicks in on before every page is loaded, and determines if the user comes from a "bad" IP.
There's a "forgiveness" mechanism built-in: offenses no longer count against an IP after a while.
Thanks!
Extra note: it'd be awesome if the solution worked in PHP, but I'd love to hear your thoughts about the approach in general, for any language/platform

Take a look at fail2ban. A python framework that allows you to raise IP tables blocks from tailing log files for patterns of errant behaviour.

are you on a *nix machine? this sort of thing is probably better left to the OS level, using something like iptables
edit:
in response to the comment, yes (sort of). however, the idea is that iptables can work independently. you can set a certain threshold to throttle (for example, block requests on port 80 TCP that exceed x requests/minute), and that is all handled transparently (ie, your application really doesn't need to know anything about it, to have dynamic blocking take place).
i would suggest the iptables method if you have full control of the box, and would prefer to let your firewall handle throttling (advantages are, you don't need to build this logic into your web app, and it can save resources as requests are dropped before they hit your webserver)
otherwise, if you expect blocking won't be a huge component, (or your app is portable and can't guarantee access to iptables), then it would make more sense to build that logic into your app.

I think it should be a combination of user-name plus IP block. Not just IP.

you're looking at custom lockout code. There are applications in the open source world that contain various flavors of such code. Perhaps you should look at some of those, although your requirements are pretty trivial, so mark an IP/username combo, and utilize that for blocking an IP for x amount of time. (Note I said block the IP, not the user. The user may try to get online via a valid IP/username/pw combo.)
Matter of fact, you could even keep traces of user logins, and when logging in from an unknown IP with a 3 strikes bad username/pw combo, lock that IP out for however long you like for that username. (Do note that a lot of ISPs share IPs, thus....)
You might also want to place a delay in authentication, so that an IP cannot attempt a login more than once every 'y' seconds or so.

I have developed a system for a client which kept track of hits against the web server and dynamically banned IP addresses at the operating system/firewall level for variable periods of time for certain offenses, so, yes, this is definitely possible. As Owen said, firewall rules are a much better place to do this sort of thing than in the web server. (Unfortunately, the client chose to hold a tight copyright on this code, so I am not at liberty to share it.)
I generally work in Perl rather than PHP, but, so long as you have a command-line interface to your firewall rules engine (like, say, /sbin/iptables), you should be able to do this fairly easily from any language which has the ability to execute system commands.

err this sort of system is easy and common, i can give you mine easily enough
its simply and briefly explained here http://www.alandoherty.net/info/webservers/
the scripts as written arn't downloadable {as no commentry currently added} but drop me an e-mail, from the site above, and i'll fling the code at you and gladly help with debugging/taloring it to your server

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string