how to Disallow specific urls and parameters in robots.txt? [duplicate]

how to Disallow specific urls and parameters in robots.txt? [duplicate] - web

I would like Google to ignore URLs like this:
http://www.mydomain.example/new-printers?dir=asc&order=price&p=3
In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?

Here's a solutions if you want to disallow query strings:
Disallow: /*?*
or if you want to be more precise on your query string:
Disallow: /*?dir=*&order=*&p=*
You can also add to the robots.txt which url to allow
Allow: /new-printer$
The $ will make sure only the /new-printer will be allowed.
More info:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/

You can block those specific query string parameters with the following lines
Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=
So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.

Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.
Site Configuration -> URL Parameters
You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.

Related

Bots not following robots.txt file

It seems like some bots are not following my robots.txt file, including MJ12bot which is the one from majestic.com and is supposed to follow the instructions.
The file looks like this:
User-agent: google
User-agent: googlebot
Disallow: /results/
Crawl-Delay: 30
User-agent: *
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 30
What I aim to tell the bots is that:
Only google can crawl any url containing /travel/, /viajar/ or /reisen/.
None of them should access any url containing /results/.
The time-span between 2 queries should be at least 30secs.
However, MJ12bot is crawling urls containg /travel/, /viajar/ or /reisen/ anyway, and in addition, it does not wait 30secs between queries.
mydomain.com/robots.txt is showing the file as expected.
Is there anything wrong with the file?

Your robots.txt is correct.
For example, the MJ12bot should not crawl http://example.com/reisen/42/, but it may crawl http://example.com/42/reisen/.
If you checked that the host is the same (https vs. http, www vs. no www, same domain name), you could consider sending Majestic a message:
We are keen to see any reports of potential violations of robots.txt by MJ12bot.
If you don’t want to wait, you could try if it works when targeting MJ12bot directly:
User-agent: MJ12bot
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 20
(I changed the Crawl-Delay to 20 because that’s the maximum value they support. But specifying a higher value should be no problem, they round it down.)
Update
Why might they crawl http://example.com/42/reisen/? That might be actually my problem, since the url has the form example.com/de/reisen/ or example.com/en/travel/... Should I change to */travel/ then?
A Disallow value is always the beginning of the URL path.
If you want to disallow crawling of http://example.com/de/reisen/, all of the following lines would achieve it:
Disallow: /
Disallow: /d
Disallow: /de
Disallow: /de/
Disallow: /de/r
etc.
In the original robots.txt specification, * has no special meaning in Disallow values, so Disallow: /*/travel/ would literally block http://example.com/*/travel/.
Some bots support it, though (including Googlebot). The documentation about the MJ12bot says:
Simple pattern matching in Disallow directives compatible with Yahoo's wildcard specification
I don’t know the Yahoo spec they refer to, but it seems likely that they’d support it, too.
But if possible, it would of course be better to rely on the standard features, e.g.:
User-agent: *
Disallow: /en/travel/
Disallow: /de/reisen/

Disallow certain urls from googlebots

I have the following urls:
http://www.website.com/somethingawesome/?render=xml
http://www.website.com/somethingawesome/?render=json
What I want is to disallow google from indexing when the url has ?render=xml or ?render=json inside it. This can be variable to any url.
My thoughts are :
Disallow: /?render=xml
Disallow: /?render=json
Will this work though? Should I be concerned about the url portion too? How can I make this work?

You will need the wildcard first:
Disallow: /*?render=xml
Disallow: /*?render=json

How to stop all search engines, bots to crawl some urls

I want to count ads click on a widget.
I've used on robot.txt file:
User-agent: *
Allow: /
Disallow: */ads_count/*
I've also add nofollow for all links in that widget.
But many bots still follow urls in that widget. I've get client ip to count urls and i got many IP form bots.

Did u try removing the (*) before */ads_count?
As google documentation for SEO says, if you want to block all the bots, it's like u did:
User-agent: * // (to whom? (*) means all bots!
Disallow: /ads_count
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.

Allow and the wildcard * in Disallow are not part of the original robots.txt specification, so not all robots.txt parsers will know/note those rules.
If you want to block all pages starting with /ads_count/, you just need:
User-agent: *
Disallow: /ads_count/
However: not all bots respect robots.txt. So you'd still get hits by bad bots that ignore robots.txt.

robots.txt and Mod Rewrite in .htaccess

In the robots.txt file, I am about to disallow some sections of my site.
For instance, I don't want my "terms and conditions" to be indexed by search engines.
User-agent: *
Disallow: /terms
The real path to the file is actually
/data/terms_and_conditions.html
But I have used .htaccess to rewrite the URL.
Now to my Q, should I specify the rewritten url in the robots.txt or the actual url?
Follow-up question: Do I need to have an "allow" line too, or will the search engines assume all other is allowed which isn't in the robots.txt file?
Thanks

Search engines will assume that all other is allowed which isn't in the robots.txt. In your case it will disallow path /term.

Disallow dynamic htaccess rewritten url

How can i disallow in robots.txt indexing of pages
http://example.net/something,category1.php
http://example.net/something,category2.php
(...)
http://example.net/something,category152.php
I have try
Disallow: /something,*.php
But it say, i can't use wildcard (*) here.

You can change link format and pack all Disallow pages into one url directory or write program/script to generate in robots.txt a Disallow all posible your Disallow link looks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to Disallow specific urls and parameters in robots.txt? [duplicate] - web

I would like Google to ignore URLs like this: http://www.mydomain.example/new-printers?dir=asc&order=price&p=3 In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?

You can block those specific query string parameters with the following lines Disallow: /?dir= Disallow: /?order= Disallow: /?p= So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.

Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters. Site Configuration -> URL Parameters You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.

Related

Bots not following robots.txt file

Disallow certain urls from googlebots

How to stop all search engines, bots to crawl some urls

robots.txt and Mod Rewrite in .htaccess

Disallow dynamic htaccess rewritten url

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to Disallow specific urls and parameters in robots.txt? [duplicate] - web

I would like Google to ignore URLs like this: http://www.mydomain.example/new-printers?dir=asc&order=price&p=3 In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?

You can block those specific query string parameters with the following lines Disallow: /*?*dir= Disallow: /*?*order= Disallow: /*?*p= So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.

Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters. Site Configuration -> URL Parameters You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.

Related

Bots not following robots.txt file

Disallow certain urls from googlebots

How to stop all search engines, bots to crawl some urls

robots.txt and Mod Rewrite in .htaccess

Disallow dynamic htaccess rewritten url

Categories

Resources

You can block those specific query string parameters with the following lines Disallow: /?dir= Disallow: /?order= Disallow: /?p= So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.