I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?
Maybe it's clearer with an example: I cannot crawl this page:
https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE
Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?
robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.
The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.
Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.
Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
[…]
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.
So, robots are programs that automatically retrieve documents linked/referenced in other documents.
If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.
The FAQ "What is a WWW robot?" confirms this:
Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
Related
I understand that naming a file to disallow in robots.txt will stop well behaved crawlers from scanning that file's content, but does it (also) stop the file being listed as a search result?
No, both Google and Bing will not stop indexing the file just because it appears in robots.txt:
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
https://developers.google.com/search/docs/advanced/robots/intro
It is important to understand that this not by definition implies that a page that is not crawled also will not be indexed. To see how to prevent a page from being indexed see this topic.
https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec
My site is under security audit to get the security certification. After audit they gave me two security issues to look at.
Stored Cross Site Scripting: The application must implement server side validation for all user-entered inputs. Only expected values should be accepted. Script tags should be rejected. All user inputs should be sanitized.
Malicious File Upload
I have added the at filter tags in Joomla global configuration text filters. And also though I have clearly stated for all file upload elements to only use .jpg,.jpeg,.png extensions, I can still upload .php extension files.
How can we rectify these two issues?
Regards
Use the defines.php file to clean the POST data before it reaches the Joomla site, and block any request with $_FILES in it.
If your website needs to allow users to upload files, then make sure that these files only consist of specific file types, and, if you don't need the external world to have access to these files, then block access to these files (in the folder you have them uploaded to) using an htaccess rule.
I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.
URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.
You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.
If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.
You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php
This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html
I'm not quite sure whether this is the suitable forum to post my question. I'm analyzing web server logs both in Apache and IIS log formats. I want to find the evidences for automatic browsing(Ex. Web robots,spiders,bots etc.) I used python robot-detection 0.2.8 for detecting robots in my log files. Anyway there may be other robots(automatic programs) which have traversed through the web site but robot-detection can not identify.
So are there any specific clues that can be found in log files(that human users do not perform but software perform actions etc)?
Do they follow a specific navigation pattern?
I saw some requests for favicon.ico? Does this implicate that it is a automatic browsing?.
I found this article with some valuable points.
The article on how to identify robots has some good information. Other things you might consider.
If you see a request for an HTML page, but it isn't followed by requests for the images or script files that the page uses, it's very likely that the request came from a crawler. If you see lots of those from the same IP address, it's almost certainly a crawler. It could be the Lynx browser (text only), but it's more likely a crawler.
It's pretty easy to spot a crawler that scans your entire site very quickly. But some crawlers go more slowly, waiting 5 minutes or more between page requests. If you see multiple requests from the same IP address, spread out over time but at very regular intervals, it's probably a crawler.
Repeated 403 (Unauthorized) entries in the log from the same IP. It's rare that a human will suffer through more than a handful of 403 errors before giving up. An unsophisticated crawler will blindly try URLs on the site, even if it gets dozens of 403s.
Repeated 404's from the same IP address. Again, a human will give up after some small number of 404s. A crawler will blindly push on ... "I know there's a good URL in here somewhere."
A user-agent string that isn't one of the major browsers' agent strings. If the user-agent string doesn't look like a browser's user agent string, it's probably a bot. Note that the reverse isn't true; many bots set the user agent string to a known browser user agent string.
I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?
Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.
It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all
If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/
In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />