These days robots.txt became an important tool for SEO in websites. Through this file, web developers says crawler robots to check and not to check specific paths. But on the other hand, there are many secret and important directories and files inside websites that their paths must not mention anywhere to anyone to decrease security risks. Speaking about them is like giving a map to a thief to find all doors.
The problem is that robots.txt is in plain format and easy to read by every body because it almost stores in root directory with full read permission. So if I have a file like this
User-Agent: *
Disallow:
Disallow: /admin/
I am saying to everybody (specially hackers): "I have a directory named admin and it must not be crawled". Whereas I did not like others know there is such directory in my website.
How can we solve this problem?
You can specify the beginning of the URL path only.
In case of /admin/, you could for example specify:
Disallow: /adm
You just have to find the string that only blocks the URLs you want to block, and not others (like /administer-better).
Depending on your URL structure, it might make sense to add a path segment to all "secret" URLs, and only refer to this segment in your robots.txt, and not the following segments:
Disallow: /private/
# nothing to see when visiting /private/
# the secret URLs are:
# /private/admin/
# /private/login/
You can use the X-Robots-Tag in the page you don't want to be crawled .
But I really prefer a IP whitelist when is available .
I'm looking for an advice and the method to so;
I have a folder on my domain where I am testing a certain landing page;
If it goes well I'll might build a new website and domain with this landing page,
and that's the main reasons I don't want it to get crawled, so I won't be punished by Google for duplicate content. I also don't want unwanted bots to scrape this landing page, as no good can come out of it. does it make sense to you?
If so, how can I do this? I don't think robots.txt is the best method as I understood that not all crawlers respect it, and even google may not fully respect it. I can't put a password since the landing page should be open to all humans (so the solution must not cause any problem to human visitors). does it leave the .htaccess file? If so, what code should I add there? are there any downsides I didn't get?
Thanks!
Use robots.txt file with following content:
User-agent: *
Disallow: /some-folder/
In my opinion is is not wise.
e.g. check this:
http://edition.cnn.com/robots.txt
http://www.bbc.co.uk/robots.txt
http://www.guardian.co.uk/robots.txt
according from this:
http://www.joomla.org/robots.txt
Joomla.org have not changed the default administration folder :D
E.g. prestashp page has a blank robots.txt file which is not perfect, but at least better in my opinion:
http://www.prestashop.com/robots.txt
Are these people stupid or they think that it is ok to know how they web strtucture look like?
Why are they not using htaccess to deny access for robots etc?
The problem is that .htaccess can't intuitively tell that a visitor is a search engine bot.
Most bots will identify themselves in the user-agent string, but some won't.
Robots.txt is accessed by all the bots looking to index the site, and unscrupulous bots are not going to
Identify themselves as a bot
Pay any attention to robots.txt (or they will deliberately disobey it).
I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?
Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.
It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all
If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/
In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />
We want to disallow from user-agents JavaScript files and CSS files and pictures, correct? Classes, modules and other folders of such a type should be htaccess protected. Am I right? If no, please let me know about that.
As result, a typical robots.txt (and we don't forget to password protect the other folders) could contain only several strings:
User-agent: *
Disallow:
Disallow: /cssfiles/
Disallow: /jsfiles/
Disallow: /pics/
Does it make sense to disallow both mysite.com?index.php&page=registration and mysite.com?index.php&page=login? If yes (what for?), then how?
Also, did I forget something?
Folders that have a basic HTTP authentication requirement applied by an .htaccess file don't have to be in your robots.txt file because spiders will not be able to access them.
I typically do not exclude css/javascript when building sites. I don't think the major search engines are interested in listing those files in their search results because they are not useful to most people. However, if you want to be on the safe side then there is no harm in adding them.
As for images, if you don't want them appearing in places like Google Images then you can add your image folder to robots.txt.
I would not attempt to disallow your registration or login pages. They are legitimate areas of your site and should be indexed.
A very important thing to remember about robots.txt files are that they do not have the ability to enforce their directives. They can only make recommendations to the spider not to crawl certain things. While most major search engines will respect this, some homemade and/or malicious spiders will not. If there's something you want to protect from spiders make sure it is either protected by some authentication mechanism or not web-accessible.