How to stop all search engines, bots to crawl some urls - search

I want to count ads click on a widget.
I've used on robot.txt file:
User-agent: *
Allow: /
Disallow: */ads_count/*
I've also add nofollow for all links in that widget.
But many bots still follow urls in that widget. I've get client ip to count urls and i got many IP form bots.

Did u try removing the (*) before */ads_count?
As google documentation for SEO says, if you want to block all the bots, it's like u did:
User-agent: * // (to whom? (*) means all bots!
Disallow: /ads_count
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.

Allow and the wildcard * in Disallow are not part of the original robots.txt specification, so not all robots.txt parsers will know/note those rules.
If you want to block all pages starting with /ads_count/, you just need:
User-agent: *
Disallow: /ads_count/
However: not all bots respect robots.txt. So you'd still get hits by bad bots that ignore robots.txt.

Related

how to Disallow specific urls and parameters in robots.txt? [duplicate]

I would like Google to ignore URLs like this:
http://www.mydomain.example/new-printers?dir=asc&order=price&p=3
In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?
Here's a solutions if you want to disallow query strings:
Disallow: /*?*
or if you want to be more precise on your query string:
Disallow: /*?dir=*&order=*&p=*
You can also add to the robots.txt which url to allow
Allow: /new-printer$
The $ will make sure only the /new-printer will be allowed.
More info:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/
You can block those specific query string parameters with the following lines
Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=
So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.
Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.
Site Configuration -> URL Parameters
You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.

Bots not following robots.txt file

It seems like some bots are not following my robots.txt file, including MJ12bot which is the one from majestic.com and is supposed to follow the instructions.
The file looks like this:
User-agent: google
User-agent: googlebot
Disallow: /results/
Crawl-Delay: 30
User-agent: *
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 30
What I aim to tell the bots is that:
Only google can crawl any url containing /travel/, /viajar/ or /reisen/.
None of them should access any url containing /results/.
The time-span between 2 queries should be at least 30secs.
However, MJ12bot is crawling urls containg /travel/, /viajar/ or /reisen/ anyway, and in addition, it does not wait 30secs between queries.
mydomain.com/robots.txt is showing the file as expected.
Is there anything wrong with the file?
Your robots.txt is correct.
For example, the MJ12bot should not crawl http://example.com/reisen/42/, but it may crawl http://example.com/42/reisen/.
If you checked that the host is the same (https vs. http, www vs. no www, same domain name), you could consider sending Majestic a message:
We are keen to see any reports of potential violations of robots.txt by MJ12bot.
If you don’t want to wait, you could try if it works when targeting MJ12bot directly:
User-agent: MJ12bot
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 20
(I changed the Crawl-Delay to 20 because that’s the maximum value they support. But specifying a higher value should be no problem, they round it down.)
Update
Why might they crawl http://example.com/42/reisen/? That might be actually my problem, since the url has the form example.com/de/reisen/ or example.com/en/travel/... Should I change to */travel/ then?
A Disallow value is always the beginning of the URL path.
If you want to disallow crawling of http://example.com/de/reisen/, all of the following lines would achieve it:
Disallow: /
Disallow: /d
Disallow: /de
Disallow: /de/
Disallow: /de/r
etc.
In the original robots.txt specification, * has no special meaning in Disallow values, so Disallow: /*/travel/ would literally block http://example.com/*/travel/.
Some bots support it, though (including Googlebot). The documentation about the MJ12bot says:
Simple pattern matching in Disallow directives compatible with Yahoo's wildcard specification
I don’t know the Yahoo spec they refer to, but it seems likely that they’d support it, too.
But if possible, it would of course be better to rely on the standard features, e.g.:
User-agent: *
Disallow: /en/travel/
Disallow: /de/reisen/

Made changes to robots.txt but search engines still say description not available

Most of the questions I see are trying to hide the site from being indexed by search engines. For myself, I'm attempting the opposite.
For the robots.txt file, I've put the following:
# robots.txt
User-agent: *
Allow: /
# End robots.txt file
To me, this means that the search engines are allowed to search the directory. However, when I test it out it still displays the website as "A description for this result is not available because of this site's robots.txt" but when I clicked on the link, it's displaying the above code.
I'm guessing it's because it takes awhile for Google and Bing to catch up? Or am I doing something wrong?
If it's because they haven't caught up to the changes made yet (these changes were made yesterday afternoon), then does anyone have a rough estimate to when the changes will be reflected?
Yeah, it takes some time until search engines crawl your pages resp. your robots.txt again. There can be no serious estimate, as it depends on too many factors. Some search engines offer a service in their webmaster tools to recrawl specific pages, but there is no guarantee that this happens shortly.
Note that your robots.txt is equivalent to:
# robots.txt
User-agent: *
Disallow:
# End robots.txt file
(Many parsers know/understand Allow, but it is not part of the original robots.txt specification.)
And this robots.txt is equivalent to no robots.txt at all (or an empty robots.txt), because Disallow: (= allowing all URLs) is the default.

Blocking a subdomain of a website from being indexed

how can i prevent a subdomain to not show up in search results. like it.domain.com
Is there an htaccess file??
Put a robots.txt into its root directory and have it say
User-agent: *
Disallow: /
decent spiders (i.e. all big search engines) will respect it.
I'm not sure but Try this::
User-agent: *
Disallow: /
NOTE :this code is already submitted

robots.txt and Mod Rewrite in .htaccess

In the robots.txt file, I am about to disallow some sections of my site.
For instance, I don't want my "terms and conditions" to be indexed by search engines.
User-agent: *
Disallow: /terms
The real path to the file is actually
/data/terms_and_conditions.html
But I have used .htaccess to rewrite the URL.
Now to my Q, should I specify the rewritten url in the robots.txt or the actual url?
Follow-up question: Do I need to have an "allow" line too, or will the search engines assume all other is allowed which isn't in the robots.txt file?
Thanks
Search engines will assume that all other is allowed which isn't in the robots.txt. In your case it will disallow path /term.

Resources