How to keep bots and spiders from consuming my server's resources? - .htaccess

My PHP script counts unique visitors. The count compared to Google Analytics was absurd; 30 000 a day but Analytics counts 2000. 2000 is the correct number, so I added a condition to my script to avoid counting bots and spiders.
I also made it identify the bots; in little more than 1 minute I had over a 100. Memory is limited and bots are consuming resources, I want to avoid this. My robots.txt :
# Allow Google, Yahoo and Bing to crawl all beside of /admin/
User-agent: Googlebot
User-agent: Yahoo! Slurp
User-agent: msnbot
Disallow: /admin/
Disallow: /analitics/
Disallow: /class/
Allow: /
# Disallow all other to crawl everywhere
User-agent: *
Disallow: /
Is there a way to prevent this many requests? I don't mind the crawler of Google or Bing, but this is ridiculous. A sample :
es ip:40.77.167.161 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.140 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.177 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.191 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

First, we have do distinguish between bots that announce correctly and bots who claim to be somebody else. There are a lot of bots who act as other bots for various reasons (not only bad ones). Among the group of bots who announce as somebody else there is a subgroup that also ignores robots.txt.
In your case, we seem to have an actual bot from Microsoft.
Bots obeying robots.txt
The IPs in fact seems to be correct IP addresses from Bing. Microsoft explains that to find out whether an IP address actually belongs to them, we should use nslookup.
This gives us:
$ nslookup 40.77.167.87
87.167.77.40.in-addr.arpa name = msnbot-40-77-167-87.search.msn.com.
$ nslookup msnbot-40-77-167-87.search.msn.com
[...]
Non-authoritative answer:
Name: msnbot-40-77-167-87.search.msn.com
Address: 40.77.167.87
Microsoft does not provide a clear example for which name their bot will listen to in the robots.txt file, but I think that they want to see bingbot and not msnbot.
Note
Bingbot, upon finding a specific set of instructions for itself, will ignore the directives listed in the generic section, so you will need to repeat all of the general directives in addition to the specific directives you created for them in their own section of the file.
[...]
Bots are referenced as user-agents in the robots.txt file. [...]
(highlighting by me)
On the other hand, Microsoft links to the Robots Database for a list of valid robot names which does not list Bingbot at all (only msnbot, which you already use).
Still, I would try to add bingbot to your user-agents in the robots.txt file and see if that helps.
You did not include the actual path requested on each request. There seem to be situations where Microsoft's Bingbot cannot be blocked with robots.txt.
Bots not obeying robots.txt
For bots who do not obey robots.txt, you can only use server side detection. You could either block them based on their IP (if you see the requests coming from the same IP all the time) or on their user-agent (if they announce as the same user-agent all the time).
E.g. some websites block the user-agent scrapy on the server-side (and return an empty page or 404 or similar), because this is the default used by a popular web scraping framework.
You can also implement the IP based blocking automatically, e.g. if you see more than k requests over x hours, then you block that IP for the next 10*x hours. This can lead to false positives of course if the IP belongs to a consumer ISP, because they often give the same IP address to different users. This means, you might be blocking normal users. However, if you have 2000 visitors per day, I assume the risk that two visitors of your website have the same IP address and that got blocked due to too many requests is low.

Related

Some websites return a forbidden response only in Firefox on Linux (changing the user agent to Chrome works though); common cause?

In the past few years I've sometimes ran into websites that don't work in Firefox on Linux, and I'm trying to understand why so I can notify the owners with more than just a vague “it doesn't work”.
Now this happens of course. While most web developers do test in Firefox, not many will have tested their products in Firefox on Linux, and some really don't care. Some only target Chrome/Webkit and don't bother with Firefox at all. That is not what this question is about though.
There is something here that makes me suspect that there is an underlying cause that is repeated on seemingly unrelated websites, and I suspect some repeated bit of configuration code or web content serving library or application that does this. Something is fishy.
The problem
The websites affected return only a plain HTML message with a 403 HTTP status code for any resource requested; it looks like this:
Forbidden
You don't have permission to access / on this server.
These websites do work when:
The operating system is not a Linux distribution
or
The browser is not Firefox
Example websites
While I normally wouldn't include a link to someone else's website, in this case I do because it is the website of a doctors office. These websites should be available to any patient at all times for anything short of a imminently life threatening emergency (in which case the national emergency number should be called of course) to provide contact information in times of need.
This website displays the symptoms described above: https://www.huisartsenpraktijkdehaan.nl/
There are more websites though, but the pattern is always the same.
The user-agent string
Trying to figure out what is actually causing this seems simple enough though. If I change the user-agent string to that of Chrome, it works.
So my tentative conclusion is that this is purely a user-agent driven bug/feature.
Some further testing yields this:
These work
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36
Foo
Mozilla/5.0 (X11; Ubuntu; x86_64; rv:85.0) Gecko/20100101 Firefox/85.0
Mozilla/5.0 (X11; Ubuntu; inux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0
Mozilla/5.0 (X11; Ubuntu; Linu x86_64; rv:85.0) Gecko/20100101 Firefox/85.0
Mozilla/5.0 (X11; Xubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0
X11; Ubuntu; Linu
X11;Ubuntu;Linux
11; Ubuntu; Linux
These do not work
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0
X11; Ubuntu; Linux
x11; ubuntu; linux
Hypothesis
Having the literal string X11; Ubuntu; Linux (case insensitive but including spaces and semi-colons as-is) in the user-agent HTTP header of your request triggers the broken behaviour.
The conundrum
I could, of course, reach out the owners of these websites (and eventually I will), but there is a catch. They likely won't use Firefox on Linux (because you would notice your own website being broke), and if they pass on the message to whoever maintains or built the website, the response may very well be “well it works for you, and it works for me, that user must have some weird virus-ridden computer and an ancient browser with a Bonzy Buddy toolbar”, or something similar.
So I want some more ammunition, and preferably a cause I can explain to anyone with a website like this. Even better would be to find out why this happens, and fix it at the source.
So what is happening here? Some Apache of Nginx module/config/plugin written by someone who really hates people who use Firefox on Linux? Some weird bug repeated on multiple sites?
Does anyone recognize this peculiar website behaviour?
I saw the link to this post in forwarded emails.
ErrorDocument 503 "Your connection was refused"
SetEnvIfNoCase User-Agent "X11; Ubuntu; Linux" bad_user
Deny from env=bad_user
Was set in the .htaccess to block "WP-login bots", since I use Ubuntu myself this was rather easy to replicate.

How browser knows which headers add to requests

When I type url of a site to browser's address bar, browser sends a request to get the resource by the url. But when I go to different web sites (google.com, amazon.com, etc.), requests which initialize the page, have different headers for different sites.
Where browser gets the set of request's headers to load the page if browser has only information about URL of this resource at the first initialization?
for example when I go to google.com browser sends such request headers:
:authority: www.google.com
:method: GET
:path: /
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7
cache-control: max-age=0
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: same-origin
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36
For amazon.com, the request's headers are different:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7
Connection: keep-alive
Host: amazon.com
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36
When you type in a URL into the address the bar this needs to be translated to an HTTP request.
So typing www.google.com means you need to GET the default page (/) from that server. That's basically all covered in the first 4 lines in the first request.
The browser also knows what types of format it can accept. Mostly we deliver HTML back so text/html is certainly in there, but we also accept other formats - including the completely generic */* btw! :-)
Requests are often compressed (with either gzip, deflate or the newer brotli (br) format) so the browser tells the server which of those it supports in the accept-encoding header.
When you installed your browser you also set a default language so we can tell the server that. Some servers will return different content based on this.
Then there are some security headers (I won't go into these as quite complicated).
Finally we have the user-agent header. this is basically where the browser tells the server whether it's Chrome, or Firefox or whatever. But for historical reasons it's much longer than just "Chrome".
So basically the request headers are things the browser sends to the server to give it more information about the browser and it's capabilities. For a request that's just typed into the browser the request headers will basically be the same no matter what the URL is. For additional requests made by the page - e.g. by JavaScript code they may be different if it adds more headers.
As to the differences between the two example requests you gave:
Google uses HTTP/2 (or QUIC if using Chrome but for now that's basically HTTP/2 as far as this question is concerned). you can see this if you add the option Protocol column to developer tools.
HTTP/2 has a couple of changes from HTTP/1, namely:
HTTP Header Names are lower cased. Technically in HTTP/1 they are case insensitive, but by convention many tools like browser used title case (capitalising first letter of each word).
The request (e.g. GET / HTTP/1.1) is converted to pseudo headers beginning with a colon (:method: GET, :path: /...etc.).
Host is basically :authority in HTTP/2.
:scheme is basically new in HTTP/2 as previously it wasn't explicitly part of the HTTP request, and handled at a connection level.
Connection is defunct in HTTP/2. Even in HTTP/1.1 it defaulted to keep-alive so above header was not necessary but lots of browsers and other clients sent it for historical reasons.
I think that explains all the differences.
So how does the browser know whether to use HTTP/2 or HTTP/1.1? Which already has an answer on Stack Overflow, but basically it's decided when the HTTPS session is established if the server advises it can support HTTP/2 and the browser wants to use it.

Google's Bot Killing my Bandwidth

I am facing strange and very serious problem with my website hosted accounts on hostgator reseller. since, march 23, 2018 my sites are being accessed by google's bot (userAgent: Mozilla/5.0 [compatible; Googlebot/2.1; +http://www.google.com/bot.html]). Their IP addresses are in range of 66.249.. and they change frequently. From my cPanel i can see as below;
66.249.79.79 /MzhmLzUxNzE5LzhmLzE2NjYvZmgz.asp 4/8/18, 5:30 AM 7377 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.79.75 /bXYtMzE2MS92b2svNzczODEtb2t3bw== 4/8/18, 5:29 AM 7377 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.79.75 /cTN1Lzc0ODQ3LzN1LzY0MzAvdXFi.asp 4/8/18, 5:29 AM 7377 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.79.75 /cGEwMGk2LzczODMvYTAvODU4MDEvMA== 4/8/18, 5:29 AM 7377 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.79.77 /eDItOTQ4NC8yYWIvMzMwNTctYWJtNA== 4/8/18, 5:29 AM 7377 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.79.79 /ZmlhLzc4NTk5L2lhLzMyNzcvYTVo.asp 4/8/18, 5:29 AM 7377 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
and there are tons of such. They are eating up my bandwidth and i am helpless from hostgator support team as they also don't have any specific solution to this.
Therefore, First i would like to know if there is any option or walkthrough to stop google temporarily to access my site? Secondly, can i do something to clear google's indexing urls list of my website?
Google will observe a crawl delay in seconds between requests that you can place in a robots.txt file in the root of your domain. See section called nonstandard extensions - https://en.m.wikipedia.org/wiki/Robots_exclusion_standard.
You can also use Google's search console (formally called web master tools i think) to request urls be removed from their indexes and control other behaviour / see any issues indexing your site. You will need to register and then complete a verification process to link your account to your domain. https://www.google.com/webmasters/tools/home?hl=en

Blocking Aggressive/Incoherent Bot

I have a weird bot pummeling my site. It COULD be some sort of low-level denial-of-service attack, but I think that's unlikely. I'm looking for suggestions on blocking it because it's rapidly chewing through all of my CPU and bandwidth allotments.
Here's what it does:
Roughly 650 page requests per minute, like clockwork, constantly, for weeks
Large list of IPs -- hundreds, rotating, with Geolocations randomly scattered all around the world
Rotating user agent strings, many of which are for legit browsers
HTTP_REFERER is often, but not always, filled with a spam site
And weirdest of all, the GET requests almost always generate 404 errors because most are for fully-qualified URLs which are NOT MY SITE. When they are not full URLs, they are for pages or resources that don't exist, never have, and don't even appear to be exploit attempts.
Here are some sample records from my server logs:
80.84.53.26 - - [24/Feb/2015:06:15:43 -0600] "GET http://www.proxy-listen.de/azenv.php HTTP/1.1" 404 - "http://www.google.co.uk/search?q=HTTP_HOST" "Opera/9.20 (Windows NT 6.0; U; en)"
54.147.200.126 - - [24/Feb/2015:06:15:44 -0600] "GET http://www.pinterest.com/jadajuicy07/ HTTP/1.1" 404 - "-" "Mozilla/4.0 (compatible; Ubuntu; MSIE 9.0; Trident/5.0; zh-CN)"
91.121.161.167 - - [24/Feb/2015:06:15:44 -0600] "GET http://78.37.100.242/search?tbo=d&filter=0&nfpr=1&source=hp&num=100&btnG=Search&q=%221%22+%2b+intitle%3a%22contact%22+%7efossil HTTP/1.1" 404 - "http://78.37.100.242/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
185.2.101.78 - - [24/Feb/2015:06:15:43 -0600] "GET http://mail.yahoo.com/ HTTP/1.1" 200 269726 "-" "Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.21022; .NET CLR 3.5.30729; MS-RTC LM 8; .NET CLR 3.0.30729)"
142.0.140.68 - - [24/Feb/2015:06:15:44 -0600] "GET http://ib.adnxs.com/ttj?id=4311122&cb=[CACHEBUSTER]&referrer=[REFERRER_URL] HTTP/1.0" 404 - "http://www.monetaryback.com/?p=1419" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/4.0.206.1 Safari/532.0"
This is the third time I've dealt with these same conditions. It last happened about six months ago. For reference, my site is a blog about baseball (on a blogging platform I built myself) with a few hundred regular visitors. I'm in the US, but my site contains no state secrets!
For now I've redirected all 404 errors to a script which dynamically modifies my .htaccess file to instantly ban IPs that make incoherent requests. That works, but I don't think it's sustainable.
What is this thing? And what's the best practice method of blocking it? Thanks.

IE9 user-agent issue on redirect

I am using user-agent validation on the session. If user-agent is changing we are deleting the session.
But I am facing problem with IE9 with google oauth redirect.
When IE9 is hitting our site, IE is having valid IE9 user-agent
So user-agent is
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
but after redirection from user-agent is becoming
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET CLR 1.1.4322; .NET4.0C; .NET4.0E)
so I logic for session validation is failing in this case.
Is there any way with ie9 to force IE to fallback to IE9 user-agent
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Adding a user agent check doesn't make your session more secure. There is no condition in which an attacker will have a session id and not have a user-agent. Your security system is identical to this: http://domain/?is_hacker=No. If you want to make your session more secure you should enable the cookie secuirty flags and remove this bullshit check.

Resources