I just keep getting a message about
"Over the last 24 hours, Googlebot encountered 1 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%.
You can see more details about these errors in Webmaster Tools. "
I searched it and told me to add robots.txt on my site
And when I test the robots.txt on Google webmaster tools ,the robots.txt just cannot be fetched.
I thought maybe robots.txt is blocked by my site ,but when I test it says allowed by GWT.
'http://momentcamofficial.com/robots.txt'
And here is the content of the robots.txt :
User-agent: *
Disallow:
So why the robots.txt cannot be fetched by Google?What did I miss .... Can anybody help me ???
I had a situation where Google Bot wasn't fetching yet I could see a valid robots.txt in my browser.
The problem turned out that I was redirecting my whole site (including robots.txt ) to https, and Google didn't seem to like that. So I excluded robots.txt from the redirect.
RewriteEngine On
RewriteCond %{HTTPS} off
RewriteCond %{REQUEST_FILENAME} !robots\.txt
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]
More info on my blog
Before Googlebot crawls your site, it accesses your robots.txt file to
determine if your site is blocking Google from crawling any pages or
URLs. If your robots.txt file exists but is unreachable (in other
words, if it doesn’t return a 200 or 404 HTTP status code), we’ll
postpone our crawl rather than risk crawling URLs that you do not want
crawled. When this happens, Googlebot will return to your site and
crawl it as soon as we can successfully access your robots.txt file.
As you know having robots.txt is optional so you don't need to make one, just make sure your host would send 200 or 404 http status only.
You have the wrong content in your robots.txt file, change it to:
User-agent: *
Allow: /
And make sure that everybody has the permissions to read the file.
I was getting this error when "yandex" crawled the site and also with some website checkers. After checking everything multiple times, I made a copy of robots.txt and called it robot.txt. Now "yandex" and the tool, both work.
Related
At LTCperformance.com, I've created a custom 404 page. If the user types in ltcperformance.com/fakepage.html, it forwards to the 404 page. But if there's no extension (ltcperformance.com/fakepage), it simply shows a default system 404 page.
I'm controlling the 404 page using htaccess:
ErrorDocument 404 http://ltcperformance.com/404.php
ErrorDocument 403 http://ltcperformance.com/404.php
ErrorDocument 500 http://ltcperformance.com/404.php
I have URL Rewriting in Joomla Administrator = on
Also, in Joomla Administrator, the Adds Suffix to URL = off
Any ideas? I've gone through every answer I can find on other posts and nothing will bring up my custom 404 page if there isn't an extension on the file.
ADDITIONAL INFO:
Any non-existent pages go to the homepage when I do these settings:
- Search Engine Friendly URLs / NO
- Use URL rewriting /Yes
- Adds Suffix to URL /No
I have someone taking a look at it on the server side, but I don't know what server issue it is - everybody online says it's a server issue but the support can't pinpoint what the actual server issue is. It's Godaddy; I did set their 404 page settings (they have a separate place to put it) to my 404 page, but that didn't work either.
Joomla .htaccess routes all requests to the index.php in order to support SEF urls.
In case it can't route a page, it will load the templates' error.php page. You can edit that to your requirements, this will be the easiest.
Should the error.php not be included in your template, copy the one in /templates/system to your template folder and customize it.
I have a site submitted to google webmaster tools that I helped a family member redevelop. The site is great and working but google says 4 old url's displayed crawl errors and return 404 errors.
From what i've read this is common when using CMS systems and page alias's get changed. However, while I have access to the php and .htaccess files and I can see that mod rewrite is on I don't know how to implement a 301 redirect. I'm not familiar with php code.
The 4 url's google says are errors I can't even find, so I don't know where they are being linked from. For example this is one of the errors:
mywebsite.co.uk/index.php?page=kitchens (which produces a 404 error page)
should link, or redirect to,
mywebsite.co.uk/kitchens-northampton/
I noticed this code in the htaccess file if it's of any relevance:
RewriteRule ^(.+)$ index.php?page=$1
I just don't know what it means. Can anyone offer some advice?
Thank you.
The server where I developed a wordpress site was indexed by google. The site is now live with the actual domain, but google searches find links to the site at development server adddress. The site is on the same server where developed, making it live was simply pointing the domain to this new site. I need to redirect these links, but am not having an luck.
Also, the developer server address has a tilda, which was indexed as %7E in google. I have tried various version of the following, all to no avail.
RewriteCond %{HTTP_HOST} ^cardgym\.dcaccess\.net
RewriteRule ^cardgym.dcaccess.net/~chrs/$ http://chrs.org/$1 [R=301,nc]
RewriteRule ^/%7Echrs/(.*)$ http://chrs.org/$1 [R=301,nc]
going to development server results in an 404 error in wordpress: http://cardgym.dcaccess.net/~chrs/
Thanks
Can you change your internal web server configuration so that the development domain is an alias of the live site? That would be the easiest solution imo.
Otherwise check out the answer by Sigg3.net here RewriteRule for tilde
If I understand you correctly your site is live and you moved it to the new domain.
So it appears you already have the live site up and going at http://chrs.org. So there is nothing you need to do to redirect it as far as Google indexing.
It will take Google time to crawl the new site and index it.
You can help speed up the process by asking Google to index your new site by submitting it here.
https://www.google.com/webmasters/tools/submit-url?pli=1
.htaccess does not control the way Google indexes the site. If its on the internet it will be indexed unless you prevent it. There are a few options you can do to help make those dev links disappear.
A. Add a robots.txt to the root of the dev site with this code below in it and that will keep Google/search engines that respect robots.txt from indexing it.
# Make changes for all web spiders
User-agent: *
Disallow: /
B. Block the site using htaccess protected directory for the whole site which will stop it from being crawled.
OR
C. Take the dev site down.
It appears you've already moved the dev site to live domain that's why you are getting 404. The links in Google will disappear eventually because they no longer exist. The next time Google tries to crawl your dev site and see's it's not there the links will be removed. The new site will start to show up as Google begins crawling it. There is nothing you can do right now but wait. It can literally take weeks.
If indeed you really are trying to redirect, then you can add an htaccess file on the cardgym.dcaccess.net site using redirect.
Redirect 301 /~chrs http://chrs.org
Will this robots.txt file only allow googlebot to index my site's index.php file? CAVEAT, I have an htaccess redirect that people who type in
http://www.example.com/index.php
are redirected to simply
http://www.example.com/
So, this is my robots.txt file content...
User-agent: Googlebot
Allow: /index.php
Disallow: /
User-agent: *
Disallow: /
Thanks in advance!
Not really.
Good bots
Only "good" bots follow the robots.txt instructions (not all robots and spiders bother to read/follow robots.txt). That might not even include all the main search engine's bots, but it definitely mean that some web crawlers will just completely ignore your requests (you should look at using .htaccess or password protection if you really want to stop bots/crawlers from seeing parts of your site).
Second checks
Google makes multiple visits to your website, including appearing as a browsing user. This second visit will ignore the robots.txt file. The second visit probably doesn't actually index (if that's your worry) but it does check to make sure you're not trying to fool the indexing bot (for SEO etc).
That being said your syntax is right... if that's all you're asking, then yes it'll work, just not as well as you might hope.
Absent the redirect, Googlebot would not see your site, except for the index.php.
With the redirect, it depends on how the bot handles redirects and how your htaccess does the redirect. If you return a 302, then Googlebot will see http://www.example.com/, check against robots.txt, and not see the main site. Even if you do an internal redirect and tell Googlebot that the responding page is http://www.example.com/, it will see the page but might not index it.
It's risky. To be sure that Google does index your homepage make this:
User-agent: *
Allow: /index.php
Disallow: /a
Disallow: /b
...
Disallow: /z
Disallow: /0
...
Disallow: /9
So your root "/" will not match disallow rules.
Also if you have AdSense don't forget to add
User-agent: Mediapartners-Google
Allow: /
I have a website at a.com (for example). I also have a couple of other domain names which I am not using for anything: b.com and c.com. They currently forward to a.com. I have noticed that Google is indexing content from my site using b.com/stuff and c.com/stuff, not just a.com/stuff. What is the proper way to tell Google to only index content via a.com, not b.com and c.com?
It seems as if a 301 redirect via htaccess is the best solution, but I am not sure how to do that. There is only the one htaccess file (each domain does not have its own htaccess file).
b.com and c.com are not meant to be aliases of a.com, they are just other domain names I am reserving for possible future projects.
robots.txt is the way to tell spiders what to crawl and what to not crawl. If you put the following in the root of your site at /robots.txt:
User-agent: *
Disallow: /
A well-behaved spider will not search any part of your site. Most large sites have a robots.txt, like google
User-agent: *
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /news
#and so on ...
You can simply create a redirect with a .htaccess file like this:
RewriteEngine on
RewriteCond %{HTTP_HOST} \.b\.com$ [OR]
RewriteCond %{HTTP_HOST} \.c\.com$
RewriteRule ^(.*)$ http://a.com/$1 [R=301,L]
It pretty much depends of what you want to achieve. 301 will say that the content is moved permanently (and it is the proper way of transferring PR), is this what you want to achieve?
You want Google to behave? Than you may use robots.txt, but keep in mind there is a downside: this file is readable from outside and every time located in the same place, so you basically give away the location of directories and files that you may want to protect. So use robots.txt only if there is nothing worth protecting.
If there is something worth protecting than you should password protect the directory, this would be the proper way. Google will not index password protected directories.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93708
For the last method it depends if you want to use the httpd.conf file or .htaccess. The best way will be to use httpd.conf, even if .htaccess seems easier.
http://httpd.apache.org/docs/2.0/howto/auth.html
Have your server side code generate a canonical reference that point to the page to be considered "source". Example =
Reference:
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
- Update: this link-tag is currently also supported by Ask.com, Microsoft Live Search and Yahoo!.