htaccess code for blocking bots is not working - .htaccess

I am trying to block some of these below listed bots using htaccess, and its not working.
In my PHP code, I track hits from unique bots, and log useragent of bots which passed through the htaccess block.
That is how I got this list below.
I assume that anything blocked by htaccess will not trigger the PHP script, is that right?
Bots:
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
HTAccess code:
SetEnvIfNoCase User-Agent "SemrushBot" bad_user
Deny from env=bad_user
SetEnvIfNoCase User-Agent "semrush" bad_user
Deny from env=bad_user
SetEnvIfNoCase User-Agent "BLEXBot" bad_user
Deny from env=bad_user
What am I doing wrong here? Why is htaccess not blocking these?

Related

How to block unusual bot, like "bot[\s_ :,\.\;\/\\-]" name?

until today I was blocking unwanted bots in .htaccess
SetEnvIfNoCase User-Agent .*mj12bot.* bad_bot
SetEnvIfNoCase User-Agent .*baiduspider.* bad_bot
But I noticed lastly that I have unusual bot making mess on my serwer but don't know how to block it because his name is:
bot[\s_ :,\.\;\/\\-]
I will be grateful for any help
You can use the following to deny requests for bot[\s_ :,.\;/\-]
SetEnvIfNoCase user-agent bot\[.+\]|mj12bot|baiduspider bad_bot=1
Order Allow,Deny
Allow from all
Deny from env=bad_bot
To block multiple user-agents, you may use :
SetEnvIfNoCase user-agent bot\[.+\]|.*mj12bot.*|.*baiduspider.* bad_bot=1
Order Allow,Deny
Allow from all
Deny from env=bad_bot

htaccess valid or not

i've a htaccess where i want to
- hide contents of folders -> this one ok
- redirection where we're in the wrong link -> this one is ok
- solve the problem of validator with the chrome meta -> this one i'm not sure
Here is my htacces, is it correct ?
options -indexes
ErrorDocument 404 /404/index.php
<FilesMatch "\.(htm|html|php)$">
<IfModule mod_headers.c>
BrowserMatch MSIE ie
Header set X-UA-Compatible "IE=Edge,chrome=1" env=ie
</IfModule>
</FilesMatch>

Block bot/spider via htaccess

I'm trying to block Baiduspider via htaccess but it still gets through.
Here's the full useragent of the Baiduspider that doesn't respect the robots.txt and isn't turned away by htaccess:
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Here is what I have in robots.txt to try to block him (I know this one is most likely not the real Baiduspider and the impersonator won't respect robots.txt:
User-agent: Baiduspider
Disallow: /
Here is what I have in htaccess to deal with him. Is there something incorrect in this or would someone suggest a better alternative?
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]
RewriteRule .* - [F]
I also tried this in htaccess and it still didn't solve it:
SetEnvIfNoCase user-agent "^Baiduspider" bad_bot
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</FilesMatch>

SetEnvIf not working on specific URL

I've got a password protected site, and I'm trying to allow a specific URL through so that it works for a Payment callback. The site is built using CakePHP.
The below works great however the Allow from env=allow is just not being taken into account (I've tried with my own IP address too). The setenvif mod is enabled in Apache and the other "Allow from" lines work fine. FYI it's running on Ubuntu on EC2. I've also searched on the site for similar issues and solutions but to no avail.
I've checked the $_SERVER global array in PHP for the "allow" environment variable and it exists so running out of ideas. Any help would be much appreciated!
SetEnvIf Request_URI ^/secure_trading/callback allow=1
SetEnvIf Request_URI ^/secure_trading/callback$ allow=1
SetEnvIf Request_URI "/secure_trading/callback" allow=1
SetEnvIf Request_URI "/app/weboot/secure_trading/callback" allow=1
AuthName "Protected"
AuthGroupFile /dev/null
AuthType Basic
AuthUserFile /var/www/domain.co.uk/.htpasswd
Order deny,allow
Satisfy Any
Deny from all
Allow from 127.0.0.1
Allow from env=allow
require valid-user
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteRule ^$ app/webroot/ [L]
RewriteRule (.*) app/webroot/$1 [L]
</IfModule>

how to ban crawler 360Spider with robots.txt or .htaccess?

I've got a problems because of 360Spider: this bot makes too many requests per second to my VPS and slows it down (the CPU-usage becomes 10-70%, but usually i have 1-2%). I looked into httpd logs and saw there such lines:
182.118.25.209 - - [06/Sep/2012:19:39:08 +0300] "GET /slovar/znachenie-slova/42957-polovity.html HTTP/1.1" 200 96809 "http://www.hrinchenko.com/slovar/znachenie-slova/42957-polovity.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11; 360Spider
182.118.25.208 - - [06/Sep/2012:19:39:08 +0300] "GET /slovar/znachenie-slova/52614-rospryskaty.html HTTP/1.1" 200 100239 "http://www.hrinchenko.com/slovar/znachenie-slova/52614-rospryskaty.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11; 360Spider
etc.
How can I block this spider completely via robots.txt? Now my robots.txt looks like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
User-agent: YoudaoBot
Disallow: /
User-agent: sogou spider
Disallow: /
I've added lines:
User-agent: 360Spider
Disallow: /
but that does not seem to work. How to block this angry bot?
If you offer to block it via .htaccess, so mind that it looks now like this:
# Turn on URL rewriting
RewriteEngine On
# Installation directory
RewriteBase /
SetEnvIfNoCase Referer ^360Spider$ block_them
Deny from env=block_them
# Protect hidden files from being viewed
<Files .*>
Order Deny,Allow
Deny From All
</Files>
# Protect application and system files from being viewed
RewriteRule ^(?:application|modules|system)\b.* index.php/$0 [L]
# Allow any files or directories that exist to be displayed directly
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
# Rewrite all other URLs to index.php/URL
RewriteRule .* index.php/$0 [PT]
And, in spite of presence of
SetEnvIfNoCase Referer ^360Spider$ block_them
Deny from env=block_them
this bot still tries to kill my VPS and is logged in access logs.
In your .htaccess file simply add the following :
RewriteCond %{REMOTE_ADDR} ^(182\.118\.2)
RewriteRule ^.*$ http://182.118.25.209/take_a_hike_moron [R=301,L]
This will catch ALL the bots being launched from the 182.118.2xx.xxx range and send them back to themself...
The crappy 360 bot is being fired from servers in China... so as long as you don't mind saying bye bye to crappy Chinese traffic from that IP range, this will guaranteed make those puppies disappear from reaching any files on your web site.
The following two lines in your .htaccess file will also pick it off simply by it being stupid enough to proudly put 360spider in its user agent string. This could be handy for when they use other IP ranges then the 182.118.2xx.xxx
RewriteCond %{HTTP_USER_AGENT} .*(360Spider) [NC]
RewriteRule ^.*$ http://182.118.25.209/take_a_hike_moron [R=301,L]
And yes... I hate them too !
Your robots.txt seems right. Some bots just ignore it (malicious bots crawl from any IP address from any botnet of hundreds to millions of infected devices from all around the globe), in this case you can limit the number of requests per second using mod_security module for apache 2.X
Config example here: http://blog.cherouvim.com/simple-dos-protection-with-mod_security/
[EDIT] On linux, iptables also allows restricting tcp:port connections per (x) second(s) per ip, providing conntrack capabilities are enabled on your kernel. See: https://serverfault.com/questions/378357/iptables-dos-limit-for-all-ports
You can put following rules into your .htaccess file
RewriteEngine On
RewriteBase /
SetEnvIfNoCase Referer 360Spider$ block_them
Deny from env=block_them
Note: Apache module mod_setenvif should be enabled in your server configuration
The person running the crawler might be ignoring robots.txt. You could block them via IP
order deny, allow
deny from 216.86.192.196
in .htaccess
SetEnvIfNoCase User-agent 360Spider blocked
I have lines in my .htaccess file like this to block bad bots:
RewriteEngine On
RewriteCond %{ENV:bad} 1
RewriteCond %{REQUEST_URI} !/forbidden.php
RewriteRule (.*) - [R=402,L]
SetEnvIf Remote_Addr "^38\.99\." bad=1
SetEnvIf Remote_Addr "^210\.195\.45\." bad=1
SetEnvIf Remote_Addr "^207\.189\." bad=1
SetEnvIf Remote_Addr "^69\.84\.207\." bad=1
# ...
SetEnvIf Remote_Addr "^221\.204\." bad=1
SetEnvIf User-agent "360Spider" bad=1
It will send the status code 402 Payment Required to all blacklisted IPs / user-agents.
You can put anything that you want displayed to the bot in forbidden.php.
It's quite effective.
I just had to block 360Spider. Solved with StreamCatcher on IIS (IIS7), which fortunately was already installed so only a small configuration change was needed. Details at http://needs-be.blogspot.com/2013/02/how-to-block-spider360.html
I use the following, and it helps alot! Check the HTTP_USER_AGENT for bad bots
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteCond %{REQUEST_URI} !^/error\.html$
RewriteCond %{HTTP_USER_AGENT} EasouSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} YisouSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Sogou\ web\ spider [NC]
RewriteCond %{HTTP_USER_AGENT} 360Spider [NC,OR]
RewriteRule ^.*$ - [F,L]
</IfModule>
<Location />
<IfModule mod_setenvif.c>
SetEnvIfNoCase User-Agent "EasouSpider" bad_bot
SetEnvIfNoCase User-Agent "YisouSpider" bad_bot
SetEnvIfNoCase User-Agent "LinksCrawler" bad_bot
Order Allow,Deny
Allow from All
Deny from env=bad_bot
</IfModule>
</Location>

Resources