how to restrict the site from being indexed - .htaccess

I know this question was being asked many times but I want to be more specific.
I have a development domain and moved the site there to a subfolder. Let's say from:
http://www.example.com/
To:
http://www.example.com/backup
So I want the subfolder to not be indexed by search engines at all. I've put robots.txt with the following contents in the subfolder (can I put it in a subfolder or it has to be at the root always, because I want the content at the root to be visible to search engines):
User-agent: *
Disallow: /
Maybe I need to replace it and put in the root the following:
User-agent: *
Disallow: /backup
The other thing is, I read somewhere that certain robots don't respect the robots.txt file so would just putting an .htaccess file in the /backup folder do the job?
Order deny,allow
Deny from all
Any ideas?

This would prevent that directory from being indexed:
User-agent: *
Disallow: /backup/
Additionally, your robots.txt file must be placed in the root of your domain, so in this case, the file would be placed where you can access it in your browser by going to http://example.com/robots.txt
As an aside, you may want to consider setting up a subdomain for your development site, something like http://dev.example.com. Doing so would allow you to completely separate the dev stuff from the production environment and would also ensure that your environments more closely match.
For instance, any absolute paths to JavaScript files, CSS, images or other resources may not work the same from dev to production, and this may cause some issues down the road.
For more information on how to configure this file, see the robotstxt.org site. Good luck!
As a last and final note Google Webmaster Tools has a section where you can see what is blocked by the robots.txt file:
To see which URLs Google has been blocked from crawling, visit the Blocked URLs page of the Health section of Webmaster Tools.
I strongly suggest you use this tool, as an incorrectly configured robots.txt file could have a significant impact on the performance of your website.

Related

seo toolkit - request is disallowed by a robots.txt rule

I am trying to run the SEO toolkit IIS extension on an application I have running but I keep getting the following error:
The request is disallowed by a Robots.txt rule
Now I have edited the robots.txt file in both the application and the root website so they both have the following rules:
User-agent: *
Allow: /
But this makes no difference and the toolkit still won't run.
I have even tried deleting both robots.txt files and that still doesn't make any difference.
Does anyone know any other causes for the seo toolkit to be unable to run or how to solve this problem?
To allow all robots complete access I would recommend using the following syntax (according to robotstxt.org)
User-agent: *
Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all)
The allow directive is supported only by "some major crawlers". So perhaps the IIS Search Engine Optimization (SEO) Toolkit's crawler doesn't.
Hope this helps. If it doesn't, you can also try going through IIS SEO Toolkit's Managing Robots.txt and Sitemap Files learning resource.
Check to make sure the DNS record is pointing to the correct server
If you're searching for the file, account for case sensitivity - robots.txt vs Robots.txt
Verify that the Toolkit is actually attempting to visit the site. Check the IIS logs for the presence of the "iisbot" user-agent.
The robot.txt may have been cached. Stop/restart/unload IIS (application). The robots.txt will refreshed. Open a browser reload the file. You can even delete the file to be sure that IIS is not caching it.
Basically robots.text is a file that does not allow Google to crawl the pages which are disallowed by admin so Google ignores those pages that's why those pages never rank and google never shows that data.

How to solve robots.txt vlunerability to tell important and secret paths to hackers?

These days robots.txt became an important tool for SEO in websites. Through this file, web developers says crawler robots to check and not to check specific paths. But on the other hand, there are many secret and important directories and files inside websites that their paths must not mention anywhere to anyone to decrease security risks. Speaking about them is like giving a map to a thief to find all doors.
The problem is that robots.txt is in plain format and easy to read by every body because it almost stores in root directory with full read permission. So if I have a file like this
User-Agent: *
Disallow:
Disallow: /admin/
I am saying to everybody (specially hackers): "I have a directory named admin and it must not be crawled". Whereas I did not like others know there is such directory in my website.
How can we solve this problem?
You can specify the beginning of the URL path only.
In case of /admin/, you could for example specify:
Disallow: /adm
You just have to find the string that only blocks the URLs you want to block, and not others (like /administer-better).
Depending on your URL structure, it might make sense to add a path segment to all "secret" URLs, and only refer to this segment in your robots.txt, and not the following segments:
Disallow: /private/
# nothing to see when visiting /private/
# the secret URLs are:
# /private/admin/
# /private/login/
You can use the X-Robots-Tag in the page you don't want to be crawled .
But I really prefer a IP whitelist when is available .

A description for this result is not available because of this site's robots.txt – learn more For mobile version

I created a website www.example.com. I created a mobile version of the website with subdomain www.m.example.com. I used htaccess file for redirectiong to mobile version in smartphones. I put my mobile website's files in folder named "mobile". I put a robot.txt file in main root folder for prevent indexing mobile urls in search engines result.
my robot.txt file is like this.
User-agent: *
Disallow: /mobile/
I also put a robot.txt file in folder named mobile.
User-agent: *
Disallow: /
My problem is that.
In desktop version all result and snippets are correct.
but when i searching in mobil, the result in snippet shows like this.
A description for this result is not available because of this site's robots.txt – learn more
How to solve this?
By using this robots.txt on www.m.example.com
User-agent: *
Disallow: /
you are forbidding bots to crawl any resource on www.m.example.com.
If bots are not allowed to crawl, they can’t access your meta-description.
So everything is working as intended.
If you want your pages to get crawled (and indexed), you have to allow it in your robots.txt (or remove it altogether).
By using the canonical link type, you can denote that two (or more) pages are the same, or that they only have trivial differences (e.g., different HTML structure, table sorted differently etc.), or that one is the superset of the other.
By using the alternate link type, you can denote that it’s an alternate representation of essentially the same content.
(You can see examples in my answer on Webmasters SE.)

Security concerns using robots.txt

I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.
URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.
You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.
If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.
You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php
This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html

Should we put folders that have htaccess password protection into robots.txt?

We want to disallow from user-agents JavaScript files and CSS files and pictures, correct? Classes, modules and other folders of such a type should be htaccess protected. Am I right? If no, please let me know about that.
As result, a typical robots.txt (and we don't forget to password protect the other folders) could contain only several strings:
User-agent: *
Disallow:
Disallow: /cssfiles/
Disallow: /jsfiles/
Disallow: /pics/
Does it make sense to disallow both mysite.com?index.php&page=registration and mysite.com?index.php&page=login? If yes (what for?), then how?
Also, did I forget something?
Folders that have a basic HTTP authentication requirement applied by an .htaccess file don't have to be in your robots.txt file because spiders will not be able to access them.
I typically do not exclude css/javascript when building sites. I don't think the major search engines are interested in listing those files in their search results because they are not useful to most people. However, if you want to be on the safe side then there is no harm in adding them.
As for images, if you don't want them appearing in places like Google Images then you can add your image folder to robots.txt.
I would not attempt to disallow your registration or login pages. They are legitimate areas of your site and should be indexed.
A very important thing to remember about robots.txt files are that they do not have the ability to enforce their directives. They can only make recommendations to the spider not to crawl certain things. While most major search engines will respect this, some homemade and/or malicious spiders will not. If there's something you want to protect from spiders make sure it is either protected by some authentication mechanism or not web-accessible.

Resources