robots.txt and Mod Rewrite in .htaccess

robots.txt and Mod Rewrite in .htaccess - security

In the robots.txt file, I am about to disallow some sections of my site.
For instance, I don't want my "terms and conditions" to be indexed by search engines.
User-agent: *
Disallow: /terms
The real path to the file is actually
/data/terms_and_conditions.html
But I have used .htaccess to rewrite the URL.
Now to my Q, should I specify the rewritten url in the robots.txt or the actual url?
Follow-up question: Do I need to have an "allow" line too, or will the search engines assume all other is allowed which isn't in the robots.txt file?
Thanks

Search engines will assume that all other is allowed which isn't in the robots.txt. In your case it will disallow path /term.

Related

how to Disallow specific urls and parameters in robots.txt? [duplicate]

I would like Google to ignore URLs like this:
http://www.mydomain.example/new-printers?dir=asc&order=price&p=3
In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?

Here's a solutions if you want to disallow query strings:
Disallow: /*?*
or if you want to be more precise on your query string:
Disallow: /*?dir=*&order=*&p=*
You can also add to the robots.txt which url to allow
Allow: /new-printer$
The $ will make sure only the /new-printer will be allowed.
More info:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/

You can block those specific query string parameters with the following lines
Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=
So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.

Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.
Site Configuration -> URL Parameters
You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.

Made changes to robots.txt but search engines still say description not available

Most of the questions I see are trying to hide the site from being indexed by search engines. For myself, I'm attempting the opposite.
For the robots.txt file, I've put the following:
# robots.txt
User-agent: *
Allow: /
# End robots.txt file
To me, this means that the search engines are allowed to search the directory. However, when I test it out it still displays the website as "A description for this result is not available because of this site's robots.txt" but when I clicked on the link, it's displaying the above code.
I'm guessing it's because it takes awhile for Google and Bing to catch up? Or am I doing something wrong?
If it's because they haven't caught up to the changes made yet (these changes were made yesterday afternoon), then does anyone have a rough estimate to when the changes will be reflected?

Yeah, it takes some time until search engines crawl your pages resp. your robots.txt again. There can be no serious estimate, as it depends on too many factors. Some search engines offer a service in their webmaster tools to recrawl specific pages, but there is no guarantee that this happens shortly.
Note that your robots.txt is equivalent to:
# robots.txt
User-agent: *
Disallow:
# End robots.txt file
(Many parsers know/understand Allow, but it is not part of the original robots.txt specification.)
And this robots.txt is equivalent to no robots.txt at all (or an empty robots.txt), because Disallow: (= allowing all URLs) is the default.

How to stop all search engines, bots to crawl some urls

I want to count ads click on a widget.
I've used on robot.txt file:
User-agent: *
Allow: /
Disallow: */ads_count/*
I've also add nofollow for all links in that widget.
But many bots still follow urls in that widget. I've get client ip to count urls and i got many IP form bots.

Did u try removing the (*) before */ads_count?
As google documentation for SEO says, if you want to block all the bots, it's like u did:
User-agent: * // (to whom? (*) means all bots!
Disallow: /ads_count
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.

Allow and the wildcard * in Disallow are not part of the original robots.txt specification, so not all robots.txt parsers will know/note those rules.
If you want to block all pages starting with /ads_count/, you just need:
User-agent: *
Disallow: /ads_count/
However: not all bots respect robots.txt. So you'd still get hits by bad bots that ignore robots.txt.

htaccess to remove print.html from url

We are currently hosting a large joomla site.
Google has indexed hundreds of the "print" versions of our pages.
for example if we have an article with the url:
www.mysite.com/funnyarticle.html
the joomla site automatically created:
www.mysite.com/funnyarticle/print.html
We have moved the site and deleted these pages, so they now get a 404 error from google.
We would like to redirect or rewrite (not sure what is the correct terminology) the "print" urls to their respective articles.
I would like to use htaccess to remove:
/print.html
and replace it with:
.html
I have seen examples but cannot get them to work correctly.
So I was hoping I could get specific advise on how to remove and replace the exact code above.
Thanks for your time.
Regards,
Aforantman

You can create a robot.txt file with following lines.
User-agent: *
Disallow: /*/print.html
this will disallow search engine robots to access files with name print.html.

You probably want to use a RewriteRule. See Apache's guide on how to use them: http://httpd.apache.org/docs/2.0/rewrite/rewrite_guide.html
But if you just want Google (and other search engines) to ignore those print versions, put a corresponding entry in you robots.txt. That way you don't need to fiddle around with Joomla's way of generating and accessing the print version for your human visitors.

You need to put these lines in your DOCROOT/.htaccess file:
RewriteEngine On
RewriteBase /
RewriteRule ^(.*?)/print.html $1.html [L,R=301]
This will redirect any Google user clicking through to one of these pages to the correct article. If your article names can contain / then remove the ? from the above; the rule will still work but might take a few more μS runtime :-)

You can use robots.txt as said by Jishnu.This is the best way to do this.
User-agent: *
Disallow: /*/print.html

Disallow dynamic htaccess rewritten url

How can i disallow in robots.txt indexing of pages
http://example.net/something,category1.php
http://example.net/something,category2.php
(...)
http://example.net/something,category152.php
I have try
Disallow: /something,*.php
But it say, i can't use wildcard (*) here.

You can change link format and pack all Disallow pages into one url directory or write program/script to generate in robots.txt a Disallow all posible your Disallow link looks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

robots.txt and Mod Rewrite in .htaccess - security

Search engines will assume that all other is allowed which isn't in the robots.txt. In your case it will disallow path /term.

Related

how to Disallow specific urls and parameters in robots.txt? [duplicate]

Made changes to robots.txt but search engines still say description not available

How to stop all search engines, bots to crawl some urls

htaccess to remove print.html from url

Disallow dynamic htaccess rewritten url

Categories

Resources