Prevent indexing a domain in search engines like Google and Bing

Prevent indexing a domain in search engines like Google and Bing - .htaccess

I have a domain (e.g. domain.com) which is public for all users and I have a secret sub-domain (e.g. site1.secretdomain.com) of a general domain (here secretdomain.com) just for admins of the site.
I don't want Google or other Search Engines index either the secret domain or its subdomains. Do you have any idea for that? I think robots.txt doesn't work because it makes changes for all domains.

A not so fool-proof solution is to remove, or issue a NO-Follow directive to any references to the subdomain-pages along with other necessary changes in robots.txt.
Another little more expensive, but more concrete but on a pragmatic note, would be to look into CAPTCHA or Google's ReCaptcha.
On a more theoretical note, Without much research, I guess a typical approach to the problem would be to serve a unique Cryptographic/Some form of Challenge (Computationally Expensive Problem) upon a request to and use the solution to validate a session from the user.
Even the most advanced Crawlers work with a limited Javascript execution budget; and will decide to move on to other pages once exhausted. FInd a suitable challenge, Optimize the page design to factor in for a load delay, and you have a subdomain open to all humans but not bots.

Related

Enabling SSL for a subdomain in IIS

I recently bought SSL for my website and want to create a section within the site in the form of https://secure.example.com/member/upgrade.aspx. However, I am having a hard time solving this issue since currently my website URL rewrite prohibits any subdomain and the user is logged out if he or she gets transferred to the above link.
I have search online and found some good information such as dynamically create the url without actually creating a subdomain in IIS.
Questions:
What steps are needed to achieve the objective above?
Should I have bought the wildcard certificate instead of one for a specific subdomain?
Thank you.

One option would be ignoring that url pattern for rewrite purposes or ignoring the url if the protocol is HTTPS. That said, I would take a slightly different approach here and just put the entire site behind SSL -- rewriting all the queries to the other protocol works and google is now giving rankings bumps to HTTPS so there are good business reasons to make the switch. You are already taking the pain of getting SSL involved at all -- the dedicated IP and certficate cost the same if you use them on a single page or all the pages, might as well take advantage of it and ease your management burden in the same motion.

SSL with CartThrob - in-template redirect or htaccess on the basis of URL segment?

this is a broader question than I would probably ask of the CartThrob folks, which is why I'm posting it here. What would the community recommend as far as SSL is concerned with CartThrob? The store functions are limited to a couple of key template groups. So my thinking was perhaps the best way to handle it would be htaccess on the basis of the presence of those URL segments. I would like to return the user to a non-SSL connection when they are not in the store area. So a trigger might be the first segment being "basket" or "account" for example. Or what about an in-template redirect to the secure URL? Very interested to hear the community's suggestions on how best to handle SSL within a given area of an EE site. I'm interested in whatever makes the most sense to implement, while also ensuring that, for example, all assets - even those loaded with path variables - are loaded via SSL. Thanks all!

I've always used CartThrob's https_redirect tag (docs) on my checkout screens, which will rewrite your {path}, {permalink} (etc)-created URLs to use https, as well as redirect you to the https:// version of your page if necessary.
That, combined with using the protocol-agnostic style of calling scripts and stylesheets should get you most of the way in getting your secure icon in the browser.
(Example:)
<script src="//ajax.googleapis.com/ajax/libs/jqueryui/1.9.1/jquery-ui.min.js"></script>

How to stop search engines from crawling the whole website?

I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?

Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.

It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all

If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/

In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />

Summary of malformed URL's?

I am working on a IIS http module that has the purpose of blocking our various common malformed URL's that can be used to attack my site.
Are there any good reference of what kind of URL's to look out for?
I know there is the URLScan project, but I want to understand the various attack vectors.

While you are asking for a blacklist, a more valuable approach is using a whitelist, i.e., think about which URLs are valid. This will minimize the chances of missing a specific URL pattern that can be used in a malicious fashion, but that is not caught by your blacklist pattern(s).

How to best normalize URLs

I'm creating a site that allows users to add Keyword --> URL links. I want multiple users to be able to link to the same url (exactly the same, same object instance).
So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"
The back end is in Python...
How does a search engine keep track of URLs? Do they keep a URL then take what ever it resolves to or do they toss URLs that are different from what they resolve to and just care about the resolved version?
Thanks!!!

So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"
You'd resolve user 3 by fixing up invalid URLs. www.facebook.com isn't a URL, but you can guess that http:// should go on the start. An empty path part is the same as the / path, so you can be sure that needs to go on the end too. A good URL parser should be able to do this bit.
You could resolve user 2 by making a HTTP HEAD request to the URL. If it comes back with a status code of 301, you've got a permanent redirect to the real URL in the Location response header. Facebook does this to send facebook.com traffic to www.facebook.com, and it's definitely something that sites should be doing (even though in the real world many aren't). You might allow consider allowing other redirect status codes in the 3xx family to do the same; it's not really the right thing to do, but some sites use 302 instead of 301 for the redirect because they're a bit thick.
If you have the time and network resources (plus more code to prevent the feature being abused to DoS you or others), you could also consider GETting the target web page and parsing it (assuming it turns out ot be HTML). If there is a <link rel="canonical" href="..." /> element in the page, you should also treat that URL as being the proper one. (View Source: Stack Overflow does this.)
However, unfortunately, user 1's case cannot be resolved. Facebook is serving a page at / and a page at /index.php, and though we can look at them and say they're the same, there is no technical method to describe that relationship. In an ideal world Facebook would include either a 301 redirect response or a <link rel="canonical" /> to tell people that / was the proper format URL to access a particular resource rather than /index.php (or vice versa). But they don't, and in fact most database-driven web sites don't do this yet either.
To get around this, some search engines(*) compare the content at different [sub]domains, and to a limited extent also different paths on the same host, and guess that they're the same if the content is sufficiently similar. Of course this is a lot of work, requires a lot of storage and processing, and is ultimately not terribly reliable.
I wouldn't really bother with much of this, beyond fixing up URLs like in the user 3 case. From your description it doesn't seem that essential that pages that “are the same” have to share actual identity, unless there's a particular use-case you haven't mentioned.
(*: well, Google anyway; more traditional ones traditionally didn't and would happily serve up multiple links for the same page, but I'd assume the other majors are doing something similar now.)

There's no way to know, other than "magic" knowledge about the particular website, that "/index.php" is the same as fetching "/".
So, your problem, as stated, is impossible.

i'd save 3 link as separated, since you can never reliably tell they resolve to same page. it all depends on how the server (out of our control) resolve the url.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string