I would like Google to ignore URLs like this:
http://www.mydomain.example/new-printers?dir=asc&order=price&p=3
In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?
Here's a solutions if you want to disallow query strings:
Disallow: /*?*
or if you want to be more precise on your query string:
Disallow: /*?dir=*&order=*&p=*
You can also add to the robots.txt which url to allow
Allow: /new-printer$
The $ will make sure only the /new-printer will be allowed.
More info:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/
You can block those specific query string parameters with the following lines
Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=
So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.
Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.
Site Configuration -> URL Parameters
You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.
Related
It seems like some bots are not following my robots.txt file, including MJ12bot which is the one from majestic.com and is supposed to follow the instructions.
The file looks like this:
User-agent: google
User-agent: googlebot
Disallow: /results/
Crawl-Delay: 30
User-agent: *
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 30
What I aim to tell the bots is that:
Only google can crawl any url containing /travel/, /viajar/ or /reisen/.
None of them should access any url containing /results/.
The time-span between 2 queries should be at least 30secs.
However, MJ12bot is crawling urls containg /travel/, /viajar/ or /reisen/ anyway, and in addition, it does not wait 30secs between queries.
mydomain.com/robots.txt is showing the file as expected.
Is there anything wrong with the file?
Your robots.txt is correct.
For example, the MJ12bot should not crawl http://example.com/reisen/42/, but it may crawl http://example.com/42/reisen/.
If you checked that the host is the same (https vs. http, www vs. no www, same domain name), you could consider sending Majestic a message:
We are keen to see any reports of potential violations of robots.txt by MJ12bot.
If you don’t want to wait, you could try if it works when targeting MJ12bot directly:
User-agent: MJ12bot
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 20
(I changed the Crawl-Delay to 20 because that’s the maximum value they support. But specifying a higher value should be no problem, they round it down.)
Update
Why might they crawl http://example.com/42/reisen/? That might be actually my problem, since the url has the form example.com/de/reisen/ or example.com/en/travel/... Should I change to */travel/ then?
A Disallow value is always the beginning of the URL path.
If you want to disallow crawling of http://example.com/de/reisen/, all of the following lines would achieve it:
Disallow: /
Disallow: /d
Disallow: /de
Disallow: /de/
Disallow: /de/r
etc.
In the original robots.txt specification, * has no special meaning in Disallow values, so Disallow: /*/travel/ would literally block http://example.com/*/travel/.
Some bots support it, though (including Googlebot). The documentation about the MJ12bot says:
Simple pattern matching in Disallow directives compatible with Yahoo's wildcard specification
I don’t know the Yahoo spec they refer to, but it seems likely that they’d support it, too.
But if possible, it would of course be better to rely on the standard features, e.g.:
User-agent: *
Disallow: /en/travel/
Disallow: /de/reisen/
I have the following urls:
http://www.website.com/somethingawesome/?render=xml
http://www.website.com/somethingawesome/?render=json
What I want is to disallow google from indexing when the url has ?render=xml or ?render=json inside it. This can be variable to any url.
My thoughts are :
Disallow: /?render=xml
Disallow: /?render=json
Will this work though? Should I be concerned about the url portion too? How can I make this work?
You will need the wildcard first:
Disallow: /*?render=xml
Disallow: /*?render=json
I want to count ads click on a widget.
I've used on robot.txt file:
User-agent: *
Allow: /
Disallow: */ads_count/*
I've also add nofollow for all links in that widget.
But many bots still follow urls in that widget. I've get client ip to count urls and i got many IP form bots.
Did u try removing the (*) before */ads_count?
As google documentation for SEO says, if you want to block all the bots, it's like u did:
User-agent: * // (to whom? (*) means all bots!
Disallow: /ads_count
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.
Allow and the wildcard * in Disallow are not part of the original robots.txt specification, so not all robots.txt parsers will know/note those rules.
If you want to block all pages starting with /ads_count/, you just need:
User-agent: *
Disallow: /ads_count/
However: not all bots respect robots.txt. So you'd still get hits by bad bots that ignore robots.txt.
In the robots.txt file, I am about to disallow some sections of my site.
For instance, I don't want my "terms and conditions" to be indexed by search engines.
User-agent: *
Disallow: /terms
The real path to the file is actually
/data/terms_and_conditions.html
But I have used .htaccess to rewrite the URL.
Now to my Q, should I specify the rewritten url in the robots.txt or the actual url?
Follow-up question: Do I need to have an "allow" line too, or will the search engines assume all other is allowed which isn't in the robots.txt file?
Thanks
Search engines will assume that all other is allowed which isn't in the robots.txt. In your case it will disallow path /term.
How can i disallow in robots.txt indexing of pages
http://example.net/something,category1.php
http://example.net/something,category2.php
(...)
http://example.net/something,category152.php
I have try
Disallow: /something,*.php
But it say, i can't use wildcard (*) here.
You can change link format and pack all Disallow pages into one url directory or write program/script to generate in robots.txt a Disallow all posible your Disallow link looks.