How can i disallow in robots.txt indexing of pages
http://example.net/something,category1.php
http://example.net/something,category2.php
(...)
http://example.net/something,category152.php
I have try
Disallow: /something,*.php
But it say, i can't use wildcard (*) here.
You can change link format and pack all Disallow pages into one url directory or write program/script to generate in robots.txt a Disallow all posible your Disallow link looks.
Related
I would like Google to ignore URLs like this:
http://www.mydomain.example/new-printers?dir=asc&order=price&p=3
In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?
Here's a solutions if you want to disallow query strings:
Disallow: /*?*
or if you want to be more precise on your query string:
Disallow: /*?dir=*&order=*&p=*
You can also add to the robots.txt which url to allow
Allow: /new-printer$
The $ will make sure only the /new-printer will be allowed.
More info:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/
You can block those specific query string parameters with the following lines
Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=
So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.
Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.
Site Configuration -> URL Parameters
You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.
I'm stuck on a problem with robots.txt.
I want to disallow http://example.com/forbidden and allow any other subdirectory of http://example.com. Normally the syntax for this would be:
User-agent: *
Disallow: /forbidden/
However, I don't want malicious robots to be able to see that the /forbidden/ directory exists at all - there is nothing linking to it on the page, and I want to it be completely hidden to everybody except those that know it's there in the first place.
Is there a way to accomplish this? My first thought was to place a robots.txt on the subdirectory itself, but this will have no effect. If I don't want my subdirectory to be indexed by either benign or malicious robots, am I safer listing it on the robots.txt or not listing or linking to it at all?
Even if you don’t link to it, crawlers may find the URLs anyhow:
someone else could link to it
some browser toolbars fetch all visited URLs and send them to search engines
your URLs could appear in (public) Referer logs of linked pages
etc.
So you should block them. There are two variants (if you don’t want to use access control):
robots.txt
meta-robots
(both variants only work for polite bots, of course)
You could use robots.txt without using the full folder name:
User-agent: *
Disallow: /fo
This would block all URLs starting with fo. Of course you would have to find a string that doesn’t match with other URLs you still want to be indexed.
However, if a crawler finds a blocked page somehow (see above), it may still add the URL to its index. robots.txt only disallows crawling/indexing of the page content, but using/adding/linking the URL is not forbidden.
With the meta-robots, however, you can even forbid indexing the URL. Add this element to the head of the pages you want to block:
<meta name="robots" content="noindex">
For files other than HTML there is the HTTP header X-Robots-Tag.
You're better off not listing it in robots.txt at all. That file is purely advisory; well-behaved robots will abide by the requests it makes, while rude or malicious ones may well use it as a list of potentially interesting targets. If your site contains no links to the /forbidden/ directory, then no robot will find it in any case save one which carries out the equivalent of a dictionary attack, which can be addressed by fail2ban or some similar log trawler; this being the case, including the directory in robots.txt will at best have no additional benefit, and at worst clue in an attacker to the existence of something he might otherwise not have found.
I want to count ads click on a widget.
I've used on robot.txt file:
User-agent: *
Allow: /
Disallow: */ads_count/*
I've also add nofollow for all links in that widget.
But many bots still follow urls in that widget. I've get client ip to count urls and i got many IP form bots.
Did u try removing the (*) before */ads_count?
As google documentation for SEO says, if you want to block all the bots, it's like u did:
User-agent: * // (to whom? (*) means all bots!
Disallow: /ads_count
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.
Allow and the wildcard * in Disallow are not part of the original robots.txt specification, so not all robots.txt parsers will know/note those rules.
If you want to block all pages starting with /ads_count/, you just need:
User-agent: *
Disallow: /ads_count/
However: not all bots respect robots.txt. So you'd still get hits by bad bots that ignore robots.txt.
We are currently hosting a large joomla site.
Google has indexed hundreds of the "print" versions of our pages.
for example if we have an article with the url:
www.mysite.com/funnyarticle.html
the joomla site automatically created:
www.mysite.com/funnyarticle/print.html
We have moved the site and deleted these pages, so they now get a 404 error from google.
We would like to redirect or rewrite (not sure what is the correct terminology) the "print" urls to their respective articles.
I would like to use htaccess to remove:
/print.html
and replace it with:
.html
I have seen examples but cannot get them to work correctly.
So I was hoping I could get specific advise on how to remove and replace the exact code above.
Thanks for your time.
Regards,
Aforantman
You can create a robot.txt file with following lines.
User-agent: *
Disallow: /*/print.html
this will disallow search engine robots to access files with name print.html.
You probably want to use a RewriteRule. See Apache's guide on how to use them: http://httpd.apache.org/docs/2.0/rewrite/rewrite_guide.html
But if you just want Google (and other search engines) to ignore those print versions, put a corresponding entry in you robots.txt. That way you don't need to fiddle around with Joomla's way of generating and accessing the print version for your human visitors.
You need to put these lines in your DOCROOT/.htaccess file:
RewriteEngine On
RewriteBase /
RewriteRule ^(.*?)/print.html $1.html [L,R=301]
This will redirect any Google user clicking through to one of these pages to the correct article. If your article names can contain / then remove the ? from the above; the rule will still work but might take a few more μS runtime :-)
You can use robots.txt as said by Jishnu.This is the best way to do this.
User-agent: *
Disallow: /*/print.html
In the robots.txt file, I am about to disallow some sections of my site.
For instance, I don't want my "terms and conditions" to be indexed by search engines.
User-agent: *
Disallow: /terms
The real path to the file is actually
/data/terms_and_conditions.html
But I have used .htaccess to rewrite the URL.
Now to my Q, should I specify the rewritten url in the robots.txt or the actual url?
Follow-up question: Do I need to have an "allow" line too, or will the search engines assume all other is allowed which isn't in the robots.txt file?
Thanks
Search engines will assume that all other is allowed which isn't in the robots.txt. In your case it will disallow path /term.