Can Search Engine read robots.txt if it's read access is restricted? - .htaccess

I have added robots.txt file and added some lines to restrict some folders.Also i added restriction from all to access that robots.txt file using .htaccess file.Can Search engines read content of that file?

This file should be freely readable. Search engine are like visitors on your website. If a visitor can't see this file, then the search engine will not be able to see it either.
There's absolutely no reason to try to hide this file.

Web crawlers need to be able to HTTP GET your robots.txt, or they will be unable to parse the file and respect your configuration.

The answer is no! But the simplest and safest too, is still to try:
https://support.google.com/webmasters/answer/6062598?hl=en
The robots.txt Tester tool shows you whether your robots.txt file
blocks Google web crawlers from specific URLs on your site. For
example, you can use this tool to test whether the Googlebot-Image
crawler can crawl the URL of an image you wish to block from Google
Image Search.

Related

Effect of robots.txt

I understand that naming a file to disallow in robots.txt will stop well behaved crawlers from scanning that file's content, but does it (also) stop the file being listed as a search result?
No, both Google and Bing will not stop indexing the file just because it appears in robots.txt:
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
https://developers.google.com/search/docs/advanced/robots/intro
It is important to understand that this not by definition implies that a page that is not crawled also will not be indexed. To see how to prevent a page from being indexed see this topic.
https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec

seo toolkit - request is disallowed by a robots.txt rule

I am trying to run the SEO toolkit IIS extension on an application I have running but I keep getting the following error:
The request is disallowed by a Robots.txt rule
Now I have edited the robots.txt file in both the application and the root website so they both have the following rules:
User-agent: *
Allow: /
But this makes no difference and the toolkit still won't run.
I have even tried deleting both robots.txt files and that still doesn't make any difference.
Does anyone know any other causes for the seo toolkit to be unable to run or how to solve this problem?
To allow all robots complete access I would recommend using the following syntax (according to robotstxt.org)
User-agent: *
Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all)
The allow directive is supported only by "some major crawlers". So perhaps the IIS Search Engine Optimization (SEO) Toolkit's crawler doesn't.
Hope this helps. If it doesn't, you can also try going through IIS SEO Toolkit's Managing Robots.txt and Sitemap Files learning resource.
Check to make sure the DNS record is pointing to the correct server
If you're searching for the file, account for case sensitivity - robots.txt vs Robots.txt
Verify that the Toolkit is actually attempting to visit the site. Check the IIS logs for the presence of the "iisbot" user-agent.
The robot.txt may have been cached. Stop/restart/unload IIS (application). The robots.txt will refreshed. Open a browser reload the file. You can even delete the file to be sure that IIS is not caching it.
Basically robots.text is a file that does not allow Google to crawl the pages which are disallowed by admin so Google ignores those pages that's why those pages never rank and google never shows that data.

Security concerns using robots.txt

I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.
URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.
You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.
If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.
You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php
This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html

single-page application with clean URLs without .htaccess file?

My question pertains specifically to the two pages below, but is also more generally relating to methods for using clean URLs without an .htaccess file.
http://www.decitectural.com/
and
http://www.decitectural.com/about/
The pages above are hosted on Amazon's S3, which does not allow for the use of htaccess files. As a result, I have found no easy way to create a clean url rewrite scheme that sends all requests to an index file which, in turn, interprets the URL using javascript and loads up the correct page (with AJAX, or, as is the case with decitectural, with simple div visibility toggling).
In order to circumvent this problem, I usually edit the amazon S3 bucket properties and set both the index page and the error page to the index.html file. In this case, the index.html file is served even when an invalid path (such as /about/) is requested. This has, for the most part, been a functioning solution... That is, until I realized that I was also getting a 404 with the index.html page which would stop Google from indexing it.
This has led me to seek out an alternative solution to this problem. Currently, as a temporary fix, I am actually creating the /about/ directory on the server with a duplicate of the index.html file in it. This works, but obviously is not a real solution to the problem.
I would appreciate any advice on how to set up a clean URL routing scheme on S3 or in any instance where an .htaccess file can't be used.
Here's a few solutions: Pretty URLs without mod_rewrite, without .htaccess
Also, I guess you can run a script to create the files dynamically from an array or database so it generates all your URLs:
/index.html
/about/index.html
/contact/index.html
...
And hook the script on every edit, in a cron or run manually. Not the best in terms of performance but hey, it should work.
I think you are going about it the wrong way. S3 gives you complete control of the page structure of your site. If you want your link to be "/about", just upload a file called "about", and you're done. (Set the headers so that the browser knows it's HTML.)
Yes, it will break if someone links to "/about/" or "/about.html". But pretty much any site will break if you mess with their links in odd ways. You will have to be vigilant when linking to your own site, because you won't have any rewrite rules to clean up for you. But you should have automation doing that.

Webmaster Tools Crawler 403 errors

Google Webmaster Tools is reporting 403 errors for some folders on the websites server for example:
http://www.philaletheians.co.uk/Study%20notes/
The folder isnt forbidden so dont understand why it would be 403 errors for Googles Crawler?
How come the Google Crawler is trying to browser the actual folders and not just going straight to the files in that folder? Is this somthing to do with robots.txt ?
Make sure is there any actual place or document to be present if some one request that url. I've browsed through your site and could not found a link that directs to http://www.philaletheians.co.uk/Study%20notes/
Also it seems, all the study notes are inside this "Study%20notes" directory.So actual this link will not work anyway. So check the google web master tools's link from to find where this broken link situate and cure it.
Have you set default document correctly in your web server? In apache, this comes in the DirectoryIndex setting (and defaults to index.html). Also, in general it might be better to strip off spaces etc.. from your traversable directory names (the %20 you are seeing between Study and notes is a url-encoded space character), so as to keep your URLs clean to your visitors and search engine bots.

Resources