I am storing my sitemaps in my web folder. I want web crawlers (Googlebot etc) to be able to access the file, but I dont necessarily want all and sundry to have access to it.
For example, this site (stackoverflow.com), has a site index - as specified by its robots.txt file (https://stackoverflow.com/robots.txt).
However, when you type https://stackoverflow.com/sitemap.xml, you are directed to a 404 page.
How can I implement the same thing on my website?
I am running a LAMP website, also I am using a sitemap index file (so I have multiple site maps for the site). I would like to use the same mechanism to make them unavailable via a browser, as described above.
First, decide which networks you want to get your actual sitemap.
Second, configure your web server to grant requests from those networks for your sitemap file, and configure your web server to redirect all other requests to your 404 error page.
For nginx, you're looking to stick something like allow 10.10.10.0/24; into a location block for the sitemap file.
For apache, you're looking to use mod_authz_host's Allow directive in a <Files> directive for the sitemap file.
You can check the user-agent header the client sends, and only pass the sitemap to known search bots. However, this is not really safe since the user-agent header is easily spoofed.
Stack Overflow presumably checks two things when deciding who gets access to the sitemaps:
The USER_AGENT string
The originating IP address
both will probably be matched against a database of known legitimate bots.
The USER_AGENT string is pretty easy to check in a server side language; it is also very easy to fake. More info:
For how to check the USER_AGENT string Way to tell bots from human visitors?
For instructions on IP checking Google: Google Webmaster Central: How to verify Googlebot
Related: Allowing Google to bypass CAPTCHA verification - sensible or not?
Related
I have in my main website root the file...
lib.php
So hackers keeps hitting my website with different IP addresses, different OS, different everything. The page is redirected to our 404 error page, and this 404 error page tracks visitors using standard visitor tracking analytics do allow us to see problems as they may arise.
Below is an example of the landing pages as shown in analytics by the hackers, except that I get about 200 hits per hour. Each link is a bit different as they are using a variable to set as a page url to goto.
mysite.com/lib.php?id=zh%2F78jQrm3qLoE53KZd2vBHtPFaYHTOvBijvL2NNWYE%3D
mysite.com/lib.php?id=WY%2FfNHaB2OBcAH0TcsAEPrmFy1uGMHgxmiWVqT2M6Wk%VD
mysite.com/lib.php?id=WY%2FfNHaB2OBcAH0TcsAEPrmFy1uGMHgxmiWVqJHGEWk%T%
mysite.com/lib.php?id=JY%2FfNHaB2OBcAH0TcsAEPrmFy1uGMHgxmiWVqT2MFGk%BD
I do not think I even need the file http://www.mysite.com/lib.php
Should I need it? When I visit mysite.com/lib.php it is redirected to my custom 404 page.
How can I stop this best? I am thinking by using .htaccess, but not sure the best setup?
This is most probably part of the Asprox botnet.
http://rebsnippets.blogspot.cz/asprox
Key thing is to change your password and stop using FTP protocol to access your privileged accounts.
I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.
URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.
You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.
If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.
You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php
This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html
I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?
Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.
It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all
If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/
In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />
I want to implement https on only a selection of my web-pages. I have purchased my SSL certificates etc and got them working. Despite this, due to speed demands i cannot afford to place them on every single page.
Instead i want my server to serve up http or https depending on the page being viewed. An example where this has been done is ‘99designs’
The problem in slightly more detail:
When my visitors first visit my site they only have access to non-sensitive information and therefore i want them to be presented with simple http.
Then once they login they are granted access to more sensitive information, e.g. profile information for which HTTPS is used to deliver.
Despite being logged in, if the user goes back to a non-sensitive page such as the homepage then i want it delivered using HTTP.
One common solution seems to be using the .htaccess file. The problem is that my site is relatively large meaning that to use this would require me to write a rule for every page (several hundred) to determine whether it should be server up using http or https.
And then there is the problem of defining user generated content pages.
Please help,
Many thanks,
David
You've not mentioned anything about the architecture you are using. Assuming that the SSL termination is on the webserver, then you should set up separate virtual hosts with completely seperate and non-overlapping document trees, and for preference, use a path schema which does not overlap (to avoid little accidents).
a mirroring CDN can't have the same hostname as you application server, because you need a way for the CDN to explicitly reference the application.
Why, in general, do sites like facebook run their CDN on a totally seperate host, not just a subdomain like cdn.facebook.com? example: http://profile.ak.fbcdn.net/hprofile-ak-snc4/173706_6103645_790537_q.jpg
Is the reason, that they can construct resource URLs with many different hostnames, to avoid the 4-connections-per-host limit on some browsers?
If your domain is www.example.org, you can host your static components on static.example.org. However, if you've already set cookies on the top-level domain example.org as opposed to www.example.org, then all the requests to static.example.org will include those cookies.
From: http://developer.yahoo.com/performance/rules.html#cookie_free
Because user generated content can contain nasties that may be able to access data hosted on the primary domain.
It also stops things like cookies and authentication getting sent in the request to CDN content.
Preventing users from inserting
scripts, and at the same time allowing
user submitted html is extremely
difficult to do on the server side -
ergo we must have sandboxing.
Borrowed from a fairly old whatwg post