htaccess to block all bots/crawlers/spiders except ones I allow - .htaccess

I'm looking for an aggressive block via htaccess, not robots.txt
I don't want to list every unfriendly bot under the sun, rather block them all and allow only the ones I want.
After some research I've come up with this which should block everything I listed including user agents that are not identifying themselves and are just blank or identified as \
Would this work?
RewriteCond %{HTTP_USER_AGENT} (bot|robot|crawl|krawler|spider|libwww-perl.*|-?|\ ) [NC] #blocked
RewriteCond %{HTTP_USER_AGENT} !(googlebot|bingbot|msnbot|yahoo) [NC] #allowed
RewriteRule - [F]

Related

htaccess block access from a top level domain?

Is there a way through htaccess to block access to images from my site when requested by a specific top level domain, eg ".ru"?
I currently use:
RewriteCond %{HTTP_REFERER} ^\.ru [NC,OR]
Rewriterule ^(.*)$ https://www.google.com/images/srpr/logo4w.png [r=307,NC]
but dont think its working as intended..
Thanks!
The regular expression that you are using, ^\.ru, means "anything that STARTS with .ru", so if the referer is http://some-site.ru/some-path/some-page.html, it's obviously not going to match. Try:
RewriteCond %{HTTP_REFERER} ^https?://[^/]+\.ru/? [NC]

How to write a htaccess rule specific for a given subdomain? - Avoiding indexing some files

I have the following on my .htaccess file:
Options +FollowSymlinks
#+FollowSymLinks must be enabled for any rules to work, this is a security
#requirement of the rewrite engine. Normally it's enabled in the root and we
#shouldn't have to add it, but it doesn't hurt to do so.
RewriteEngine on
#Apache scans all incoming URL requests, checks for matches in our #.htaccess file
#and rewrites those matching URLs to whatever we specify.
#allow blank referrers.
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?site.com [NC]
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?site.dev [NC]
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?dev.site.com [NC]
RewriteRule \.(jpg|jpeg|png|gif)$ - [NC,F,L]
# if a directory or a file exists, use it directly
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
# otherwise forward it to index.php
RewriteRule . index.php
site.com is the production site.
site.dev is a localhost dev environment.
dev.site.com is a subdomain where we test live.
I'm aware that this will avoid the site to be indexed:
Header set X-Robots-Tag "noindex, nofollow"
cf. http://yoast.com/prevent-site-being-indexed/
My question is however, fairly simple perhaps:
Is there a way to apply this line ONLY on dev.site.com, so that it doesn't get indexed ?
Is there a way to apply this line ONLY on dev.site.com, so that it doesn't get indexed ?
Yes, you need to put the Header line in the vhost config for dev.site.com. There's no way you can make a host check tied to a Header set directive from within an htaccess file.
The other possibility is if you want to block bots via useragent, you can remove the Header set and add some rules:
# request is for http://dev.site.com
RewriteCond %{HTTP_HOST} ^dev.site.com$ [NC]
# user-agent is a search engine bot
RewriteCond %{HTTP_USER_AGENT} (Googlebot|yahoo|msnbot) [NC]
# return forbidden
RewriteRule ^ - [L,F]
Note that the list of user agents isn't complete. You can try to go through the massive list of User-Agents and look for all of the index robots, or at least the more popular ones.

htaccess and forward one url to another

I have this rewrite code:
RewriteCond %{HTTP_USER_AGENT} ^.*iPhone.*$
RewriteRule ^(.*)$ http://stagingsite.com/site/mobile [R=301]
RewriteRule ^faq/$ /mobile/faq
The first line is working correctly. If the user is on an iphone then redirect to the mobile directory where the index page is displayed.
I also want users visiting:
http://stagingsite.com/site/faq
to get forwarded to http://stagingsite.com/site/mobile/faq if they're on an iphone but the last line of code above doesn't seem to be achieving this.
Any ideas what I have wrong?
RewriteCond directives only get applied to the *immediately following RewriteRule. So you have the condition that checks for iPhone, but that only gets applied to the redirect rule, and not the faq rule. You have to duplicate it:
RewriteCond %{HTTP_USER_AGENT} ^.*iPhone.*$
RewriteRule ^(.*)$ http://stagingsite.com/site/mobile [R=301,L]
RewriteCond %{HTTP_USER_AGENT} ^.*iPhone.*$
RewriteRule ^faq/?$ /site/mobile/faq [L]
You should also include L flags so the 2 rules don't accidentally interfere with each other. And your regex and target needs to be updated to accept an optional trailing slash and the subdirectory.
taking out the slash before mobile as
RewriteRule ^faq/$ mobile/faq
works?

Deny referrals from all domains except one

Is it possible to accept traffic from only one domain, ideally using a .htaccess file?
I want my site to only be accessible via a link on another site I have.
I know how to block one referring domain, but not all domains
RewriteEngine on
# Options +FollowSymlinks
RewriteCond %{HTTP_REFERER} otherdomain\.com [NC]
RewriteRule .* - [F]
this is my full rewrite code:
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_REFERER} .
RewriteCond %{HTTP_REFERER} !domain\.co.uk [NC]
RewriteRule .? - [F]
# The Friendly URLs part
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]
I think it is working, but none of the assets are getting loaded and I get a 500 error when I click on another link.
Make that something like:
RewriteCond %{HTTP_REFERER} .
RewriteCond %{HTTP_REFERER} !yourdomain\.com [NC]
RewriteCond %{HTTP_REFERER} !alloweddomain\.com [NC]
RewriteRule .? - [F]
The first RewriteCond checks that the referrer is not empty. The second checks that it doesn't contain the string yourdomain.com, and the third that it doesn't contain the string alloweddomain.com. If all of these checks pass, the RewriteRule triggers and denies the request.
(Allowing empty referrers is generally a good idea, since browsers can generate them for various reasons, such as when:
the user has bookmarked the link,
the user entered the link manually into the address bar,
the user reloaded the page,
the browser is configured not to send cross-site referrer infromation, or
a proxy between your site and the browser strips away the referrer information.)

.htaccess which somehow prevents Google to put some PageRank on my domains

I am using the following .htaccess code on all my domains since 2+ years ago on some projects, but no one of the websites build has ever got any Google PageRank, at least '1' bar. On all websites on which I don't use this code, I am getting a reasonable PageRank.
Could you tell me what I am doing wrong:
RewriteEngine On
RewriteBase /
# rewrite the non 'www' addresses
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
# rewrite REQUEST_URI
RewriteCond %{HTTP_HOST} ^www\.example\.com [OR]
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule (.*) index.php [L]
some of my websites using this .htaccess:
http://www.kampril.bg/
http://www.milleniumbg.eu/
Register these domains in the Google Search Console and check whether Google returns some error messages or some feedback about these. Submit some sitemaps.
If you do not see any error messages or warnings, then it simply means Google does not find the content of your websites interesting.

Resources