I seem to be receiving a lot of 404 errors from Google Webmaster tools of late. Is there a way to prohibit these sort of links from being indexed? and where do they come from?
Appreciate if you could shed some light on this matter. Thanks in advance.
I am on opencart 1.5.4.1.
These are the 404 errors
http://mydomain.com/product/search&filter_tag=Purse&sort=rating&order=ASC&limit=25
http://mydomain.com/product/search&sort=p.price&order=DESC&filter_tag=&limit=25
http://mydomain.com/cart?qty=1&id_product=323&token=bfcf19e76ed4a88abd8508e8dc097a70&add
I've also included the following entries in my robots.txt to no avail
User-agent: Googlebot
User-agent: *
Disallow: /cgi-bin
Disallow: /*?limit
Disallow: /*&limit
Disallow: /*?sort
Disallow: /*&sort
Disallow: /*?route=checkout
Disallow: /*?route=account
Disallow: /*?route=product/search
Disallow: /*?route=affiliate
Disallow: /*&keyword
Disallow: /*?page=1
That looks like you are using an extension that is trying to rewrite ugly url's to better/more friendly ones - or were at one point. You should use .htaccess to rewrite the bad URL's to the correct ones rather than just blocking pages if you can. This will help your site
Related
How to block all bots except google and Bing.
I am using Cloudflare but I am confused, how to do.
I want all bots except these face Cloudflare JS Challenge
Simple: create file in root directory robots.txt
and add which bot you want to allow or disallow
# robots.txt
User-agent: *
Disallow: /
User-agent: bingbot
allow: /
Disallow: /some-page-for-bingbot/
User-agent: googlebot
allow: /
Disallow: /some-page-for-googlebot/
We have an Umbraco website which has several sub-domains and we want to exclude one of them from being crawled in search engines for now.
I tried to change my Robots.txt file but seems I am not doing it right.
Url: http://mywebsite.co.dl/
subdomain: http://sub1.mywebsite.co.dl/
My Robots.txt content is as follow:
User-agent: *
Disallow: sub1.*
What I have missed?
The following code will block http://sub1.mywebsite.co.dl. from being indexed:
User-agent: *
Disallow: /sub1/
You can also add another robots.txt file in the sub1 folder with the following code:
User-agent: *
Disallow: /
and that should help as well.
If you want to block anything on http://sub1.mywebsite.co.dl/, your robots.txt MUST be accessible on http://sub1.mywebsite.co.dl/robots.txt.
This robots.txt will block all URLs for all supporting bots:
User-agent: *
Disallow: /
I tried to search for my site in google "DailyMuses" and got the following result
It says that "A description for this result is not available because of this site's robots.txt"
I went to check the contents of robot.text in my web app and the contents is as follows:
# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-Agent: *
Disallow: /
Can anyone advise me on how I can get around this and allow a description to be shown in the google search?
Remove the robots.txt . Done .
UPDATE : To allow search engine to crawl only the index page, use this:
User-agent: *
Allow: /index.php
Allow: /$
Disallow: /
or replace index.php with your index file name, such as index.html or index.jsp
Is this the correct way to do this - below is my txt file, would this prevent Google from indexing my admin directory as well as oldpage.php?
User-agent: *
Allow: /
Disallow: /admin/
Disallow: http://www.mysite.com/oldpage.php
Yes you are absolutely correct except single file restriction.
User-agent: * : means for all crawler
Allow: / : allow access of full site
Disallow: /admin/ : restrict to admin directory
Disallow: /oldpage.php : restrict to oldpage.php
I need some help. I'm not sure about the order on request for mod_rewrite and robots.txt.
Some urls belong to a rewrite rule:
/index.php?id=123 to /home
Other urls don't have a rewrite:
/index.php?id=444
I made this entry to my robots.txt:
User-agent: *
Disallow: /index.php?id
Will the site with /home be indexed by search engines?
The robots.txt file is interpreted by the client (spider), and they don't know what rewrites you have in your system. Thus, spiders would not fetch URLs from your site if they look like the pattern in robots.txt but would if they found the same content through /home.