Block all crawler exept Google by htaccess - .htaccess

Currently I'm using to block crawler on htaccess
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (AhrefsBot) [NC]
RewriteRule .* - [R=403,L]
But I want to set a code to block all crawlers except google bot

Related

htaccess redirect subdomain to directory

I need help to write proper rewrite rules in my htaccess files.
I need to redirect something like fr.example.com to example.com/fr, because we recently changed the whole website and the multilingual system is managed differently. The structure and the pages too.
I managed to do that successfully with this piece of code:
RewriteCond %{HTTP_HOST} ^fr\.example\.com [NC]
RewriteRule (.*) http://example.com/fr/$1 [L,R=301]
My problem now is to write something more specific for pages, for example :
fr.example.com/discover/foo should go to example.com/fr/bar/foo (different path, nothing consistant)
BUT ! example.com/discover/foo should go to example.com/bar/foo (end of the url is the same in both english and french)
Right now, since I have some common 301 redirects, the french urls aren't redirect properly and lead to the english pages. For example that one :
Redirect 301 /discover/foo /bar/otherfoo
Successfully redirects example.com/discover/foo to example.com/bar/otherfoo but also redirects fr.example.com/discover/otherfoo
How can I write two different rules for english and french? I'll have to write a bunch of different rules since everything is very different from the old subdomain to the new directory, I don't mind.
Thanks !
EDIT
Please note that it's for a wordpress installation, and the htaccess starts with :
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
First the these rules:
RewriteCond %{HTTP_HOST} ^fr\.example\.com [NC]
RewriteRule (.*) http://example.com/fr/$1 [L,R=301]
should look like this :
RewriteCond %{HTTP_HOST} ^(www\.)?fr\.example\.com [NC]
RewriteRule (.*) http://example.com/fr/$1 [L,R=301]
In order to capture bot www & non-www requests for subdomain.
Also this rule :
Redirect 301 /discover/foo /bar/foo
Will capture both requests to domain and sub-domains and using mod_rewrite here is correct not mod_alias so , replace this line with :
RewriteCond %{HTTP_HOST} ^(www\.)?example\.com [NC]
RewriteRule ^discover/foo http://example.com/bar/foo [L,R=301]
RewriteCond %{HTTP_HOST} ^(www\.)?(fr)\.example\.com [NC]
RewriteRule ^discover/foo http://example.com/%2/bar/foo [L,R=301]
Note: clear browser cache then test.

htaccess redirect HTTP_USER_AGENT but not the homepage

I need to redirect google reffered viewers to a welcome-page , the point is that if google refferes to homepage it should't redirect to welcome-page , but to the homepage itself ..
RewriteCond %{HTTP_REFERER} google\.com
RewriteCond %{REQUEST_URI} !http://homepage\.com/
RewriteRule .* http://homepage\.com/welcome-page/
It seems that after %{HTTP_REFERER} htaccess does not check the not "!" and redirects all requests to the welcome-page, even redirects the homepage request to welcome-page.
So how can i redirect google trraffics to a specefic page, but when the traffic reffers to home page it shouldn't redirect.
as you know google may bring traffic to diifferent pages on your site, homepage.com or homepage.com/page2 or homepage.com/page3 etc. i need not to redirect the homepage.com.
With help of answer on this page and a little internet search this is the answer:
RewriteCond %{HTTP_REFERER} google\.com
RewriteCond %{REQUEST_URI} !^/$ [NC]
RewriteCond %{REQUEST_URI} !/welcome-page/ [NC]
RewriteRule .* /welcome-page/ [R=302,L]
infact the home page should be excluded like : !^/$
You are not using the correct RewriteCond. To redirect google referer requests to /home-page/ , you can use the following rule
RewriteEngine on
RewriteCond %{HTTP_REFERER} google\.com [NC]
RewriteRule !welcome-page http://example.com/welcome-page/ [L,R]

redirect users by language - too many redirects

I am trying to redirect users with Chineses languages to the a domain using the following code in an .htaccess file
RewriteEngine on
RewriteCond %{HTTP:Accept-Language} (zh) [NC]
RewriteRule ^(.*)$ http://www.example.com/under_c.html [L]
When I change my browser language to Chinese and test out the redirect it does go to the specified page, but it doesn't display anything it just gives me an error in the console that says "Failed to load resource: net::ERR_TOO_MANY_REDIRECTS". I've tried other solutions around the web but none of them seem to be able to redirect by language.
Is there a better way to redirect by language in the .htaccess file?
To prevent a rewrite loop you need to exclude under_c.html from the rule.
RewriteEngine on
RewriteCond %{HTTP:Accept-Language} (zh) [NC]
RewriteCond %{REQUEST_URI} !=/under_c.html [NC]
RewriteRule ^ http://www.example.com/under_c.html [R=302,L]

Correctly redirect bot requests to static version of a website

I'm having problems getting my website to index correctly by Google.
My folder structure looks like this:
root
- cms
- www
example.com points to the root where a .htaccess routes all requests to /www:
RewriteEngine on
RewriteRule ^(.*)$ /www/$1 [L]
Front end
The Angular front end inside /www gets data from /cms via REST api. So far so good.
What I want to achieve is that bots don't crawl inside my ajaxified /www page but instead inside /cms where I print out static contents corresponding to the URL structure in /www.
URL for static content:
/www/test1 -> Outputs nice content via REST
/cms/test1 -> Outputs text-only content for the crawler
Bot redirect
I'm redirecting the bots coming to example.com/www to /cms like this:
RewriteCond %{HTTP_USER_AGENT} (googlebot|yahoo|bingbot|baiduspider) [NC]
RewriteRule ^(.*)$ http://www.example.com/cms/$1 [R=301,L]
Site map
I also registered a sitemap with Google with the following contents:
http://www.example/test1
http://www.example/test2
and so on...
The problem
This all works fine BUT: Google is also crawling the static contents inside /cms without being redirected there by me. I only want this static subdomain to be fed through the redirect but not when Google's bot is searching for it itself. Kind of "disallowing" the bot to crawl here - but in the other hand I NEED it to crawl it. A catch 22 in my opinion.
Edit: complete .htaccess file
RewriteEngine On
# Sitemap
RewriteRule ^sitemap(-+([a-zA-Z0-9_-]+))?\.xml(\.gz)?$ /cms/sitemap$1.xml$2 [L]
RewriteRule ^sitemap(-+([a-zA-Z0-9_-]+))?\.html(\.gz)?$ /cms/sitemap$1.xml$2 [L]
# Redirect bots to static pages
RewriteCond %{HTTP_USER_AGENT} (googlebot|yahoo|bingbot|baiduspider) [NC]
RewriteRule ^(.*)$ http://www.example.com/cms/$1 [R=301,L]
# Angular HTML5 mode: Don't rewrite files or directories
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !index
# Angular HTML5 mode: Rewrite everything else to index.html to allow html5 state links
RewriteRule (.*) /www/index.html [L]
Edit 2
I have added this tag to the www page
<meta name="fragment" content="!">
to let the crawler know there's AJAX being used on the page. And I'm using the rewrite suggest by #Croises but in reaction to Google's _escaped_fragment_ re-request. Let's wait a few days...
RewriteCond %{HTTP_USER_AGENT} (googlebot|yahoo|bingbot|baiduspider) [NC]
RewriteCond %{QUERY_STRING} _escaped_fragment_
RewriteCond %{REQUEST_URI} !^/cms/
RewriteRule ^(.*)$ cms/$1 [L]
You can't redirect to static page, and ask them to index or reference the final page without crawling the "real" content.
You can rewrite your link:
# Rewrite bots to static pages
RewriteCond %{HTTP_USER_AGENT} (googlebot|yahoo|bingbot|baiduspider) [NC]
RewriteCond %{REQUEST_URI} !^/cms/
RewriteRule ^(.*)$ cms/$1 [L]
Just without R=301. Like that you show the page without redirection.
But beware of cloaking (Google and Cloaking).

.htaccess which somehow prevents Google to put some PageRank on my domains

I am using the following .htaccess code on all my domains since 2+ years ago on some projects, but no one of the websites build has ever got any Google PageRank, at least '1' bar. On all websites on which I don't use this code, I am getting a reasonable PageRank.
Could you tell me what I am doing wrong:
RewriteEngine On
RewriteBase /
# rewrite the non 'www' addresses
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
# rewrite REQUEST_URI
RewriteCond %{HTTP_HOST} ^www\.example\.com [OR]
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule (.*) index.php [L]
some of my websites using this .htaccess:
http://www.kampril.bg/
http://www.milleniumbg.eu/
Register these domains in the Google Search Console and check whether Google returns some error messages or some feedback about these. Submit some sitemaps.
If you do not see any error messages or warnings, then it simply means Google does not find the content of your websites interesting.

Resources