Will this robots.txt only allow googlebot to index my site? - .htaccess

Will this robots.txt file only allow googlebot to index my site's index.php file? CAVEAT, I have an htaccess redirect that people who type in
http://www.example.com/index.php
are redirected to simply
http://www.example.com/
So, this is my robots.txt file content...
User-agent: Googlebot
Allow: /index.php
Disallow: /
User-agent: *
Disallow: /
Thanks in advance!

Not really.
Good bots
Only "good" bots follow the robots.txt instructions (not all robots and spiders bother to read/follow robots.txt). That might not even include all the main search engine's bots, but it definitely mean that some web crawlers will just completely ignore your requests (you should look at using .htaccess or password protection if you really want to stop bots/crawlers from seeing parts of your site).
Second checks
Google makes multiple visits to your website, including appearing as a browsing user. This second visit will ignore the robots.txt file. The second visit probably doesn't actually index (if that's your worry) but it does check to make sure you're not trying to fool the indexing bot (for SEO etc).
That being said your syntax is right... if that's all you're asking, then yes it'll work, just not as well as you might hope.

Absent the redirect, Googlebot would not see your site, except for the index.php.
With the redirect, it depends on how the bot handles redirects and how your htaccess does the redirect. If you return a 302, then Googlebot will see http://www.example.com/, check against robots.txt, and not see the main site. Even if you do an internal redirect and tell Googlebot that the responding page is http://www.example.com/, it will see the page but might not index it.

It's risky. To be sure that Google does index your homepage make this:
User-agent: *
Allow: /index.php
Disallow: /a
Disallow: /b
...
Disallow: /z
Disallow: /0
...
Disallow: /9
So your root "/" will not match disallow rules.
Also if you have AdSense don't forget to add
User-agent: Mediapartners-Google
Allow: /

Related

How to disallow specific pages in robots.txt, but allow everything else?

Is this the way to do it?
User-agent: *
Allow: /
Disallow: /a/*
I have pages like:
mydomaink.com/a/123/group/4
mydomaink.com/a/xyz/network/google/group/1
I don't want to allow them to appear on Google.
Your robots.txt looks correct. You can test in in your Google's Webmaster Tools account if you want to be 100% sure.
FYI, blocking pages in robots.txt doe snot guarantee they will not show up in the search results. It only prevents search engines from crawling those pages. They can still list them if they want to. To prevent a page from being indexed and listed you need to use the x-robots-tag HTTP header.
If you use Apache you can place a file in your /a/ directory with the following line to effectively block those pages:
<IfModule mod_headers.c>
Header set X-Robots-Tag: "noindex"
</IfModule>

How can i fix "Googlebot can't access your site" issue?

I just keep getting a message about
"Over the last 24 hours, Googlebot encountered 1 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%.
You can see more details about these errors in Webmaster Tools. "
I searched it and told me to add robots.txt on my site
And when I test the robots.txt on Google webmaster tools ,the robots.txt just cannot be fetched.
I thought maybe robots.txt is blocked by my site ,but when I test it says allowed by GWT.
'http://momentcamofficial.com/robots.txt'
And here is the content of the robots.txt :
User-agent: *
Disallow:
So why the robots.txt cannot be fetched by Google?What did I miss .... Can anybody help me ???
I had a situation where Google Bot wasn't fetching yet I could see a valid robots.txt in my browser.
The problem turned out that I was redirecting my whole site (including robots.txt ) to https, and Google didn't seem to like that. So I excluded robots.txt from the redirect.
RewriteEngine On
RewriteCond %{HTTPS} off
RewriteCond %{REQUEST_FILENAME} !robots\.txt
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]
More info on my blog
Before Googlebot crawls your site, it accesses your robots.txt file to
determine if your site is blocking Google from crawling any pages or
URLs. If your robots.txt file exists but is unreachable (in other
words, if it doesn’t return a 200 or 404 HTTP status code), we’ll
postpone our crawl rather than risk crawling URLs that you do not want
crawled. When this happens, Googlebot will return to your site and
crawl it as soon as we can successfully access your robots.txt file.
As you know having robots.txt is optional so you don't need to make one, just make sure your host would send 200 or 404 http status only.
You have the wrong content in your robots.txt file, change it to:
User-agent: *
Allow: /
And make sure that everybody has the permissions to read the file.
I was getting this error when "yandex" crawled the site and also with some website checkers. After checking everything multiple times, I made a copy of robots.txt and called it robot.txt. Now "yandex" and the tool, both work.

Having problems understanding how to block some URLs on robot.txt

The problem is this. I have some URLs on the system I have that have this pattern
http://foo-editable.mydomain.com/menu1/option2
http://bar-editable.mydomain.com/menu3/option1
I would like to indicate in the robot.txt file that they should not be crawled. However, I'm not sure if this pattern is correct:
User-agent: Googlebot
Disallow: -editable.mydomain.com/*
Will it work as I expect?
You can't specify a domain or subdomain from within a robots.txt file. A given robots.txt file only applies to the subdomain it was loaded from. The only way to block some subdomains and not others is to deliver a different robots.txt file for the different subdomains.
For example, in the file http://foo-editable.mydomain.com/robots.txt
you would have:
User-agent: Googlebot
Disallow: /
And in http://www.mydomain.com/robots.txt
you could have:
User-agent: *
Allow: /
(or you could just not have a robots.txt file on the www subdomain at all)
If your configuration will not allow you to deliver different robots.txt files for different subdomains, you might look into alternatives like robots meta tags or the X-robots-tag response header.
I think you have to code it like this.
User-agent: googlebot
Disallow: /*-editable.mydomain.com/
There's no guarantee that any bot will process the asterisk as a wild card, but I think the googlebot does.

No Robots robots.txt Location

A bit confused with robots.txt.
Say I wanted to block robots on a site on a Linux based Apache server in location:
var/www/mySite
I would place robots.txt in that directory (alongside index.php) containing this:
User-agent: *
Disallow: /
right?
Does that stop robots indexing the whole server or just the site in var/www/mySite? For example would the site in var/www/myOtherSite also have robots blocked? Because I just want to do it for the one site.
Thanks!
Robots (well-behaved robots, that is -- honouring robots.txt is entirely voluntary) will use the robots.txt found in the root of your domain. If mySite is served off mysite.com and myOtherSite is served off myothersite.com, then your robots.txt would only be served on mysite.com and this works as intended.
To test, just head to http://myothersite.com/robots.txt and verify that you get a 404.

How do I tell search engines not to index content via secondary domain names?

I have a website at a.com (for example). I also have a couple of other domain names which I am not using for anything: b.com and c.com. They currently forward to a.com. I have noticed that Google is indexing content from my site using b.com/stuff and c.com/stuff, not just a.com/stuff. What is the proper way to tell Google to only index content via a.com, not b.com and c.com?
It seems as if a 301 redirect via htaccess is the best solution, but I am not sure how to do that. There is only the one htaccess file (each domain does not have its own htaccess file).
b.com and c.com are not meant to be aliases of a.com, they are just other domain names I am reserving for possible future projects.
robots.txt is the way to tell spiders what to crawl and what to not crawl. If you put the following in the root of your site at /robots.txt:
User-agent: *
Disallow: /
A well-behaved spider will not search any part of your site. Most large sites have a robots.txt, like google
User-agent: *
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /news
#and so on ...
You can simply create a redirect with a .htaccess file like this:
RewriteEngine on
RewriteCond %{HTTP_HOST} \.b\.com$ [OR]
RewriteCond %{HTTP_HOST} \.c\.com$
RewriteRule ^(.*)$ http://a.com/$1 [R=301,L]
It pretty much depends of what you want to achieve. 301 will say that the content is moved permanently (and it is the proper way of transferring PR), is this what you want to achieve?
You want Google to behave? Than you may use robots.txt, but keep in mind there is a downside: this file is readable from outside and every time located in the same place, so you basically give away the location of directories and files that you may want to protect. So use robots.txt only if there is nothing worth protecting.
If there is something worth protecting than you should password protect the directory, this would be the proper way. Google will not index password protected directories.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93708
For the last method it depends if you want to use the httpd.conf file or .htaccess. The best way will be to use httpd.conf, even if .htaccess seems easier.
http://httpd.apache.org/docs/2.0/howto/auth.html
Have your server side code generate a canonical reference that point to the page to be considered "source". Example =
Reference:
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
- Update: this link-tag is currently also supported by Ask.com, Microsoft Live Search and Yahoo!.

Resources