How to get a domain un-indexed by search engines - .htaccess

I have a domain with a loto of indexed pages, I use this one as a online test domain. I understand that I should test it on a intranet or somewhat, but in time Google indexed a few websites which are not relavent anymore.
Does anyone know how to get a domain totlally unindexed from the most search engines?

There is a couple things you can do.
Set up a restrictive robots.txt file
Password protect the domain root
Request removal directly from SEs
If you have a static ip and you are the only one accessing the site, you can simply deny access to any ips other than yours.

Place a robots.txt file in the root directory of your webpage. It can be used to control how much access search engine spiders have to your content. You can specify certain areas of your site off limits to indexing, on a directory-by-directory basis.

Remove alias domain if you have
Remove url redirect from old to new
so that Search Engines can slowly de-index your old domain.

Related

Same website on same domain name with different extensions - i.e. .com and .co.uk

What is best practice for doing this? Should I have duplicate content at each domain or should I redirect from one to the the other, i.e. all traffic to the .co.uk domain redirected to the .com domain?
Best practice is to send them all to one web server.
By default the server will not care which domain is pointed at it and will show the home page as domainx.com if you to it from domainx.com.
However there are two possible issues with this that come to mind:
The person who created the website hopefully only used relative links. (The contact us button points to contactus.htm instead of http://domainx.com/contactus.htm ) If not, some links might change the user from domainx.co.uk to domainx.com.
Search Engine Optimisation: Its better SEO wise if all the links to your site point to one domain name rather than appearing as several less popular sites.
You can get everyone on the same site by using a RewriteRule or 301 Redirect to the primary site. Or you can make every hyperlink on the site absolute and point to the primary domain.

Block Bots from crawling one of my sites on a multistore multidomain prestashop

Hello i have a multistore multidomain prestashop installation with main domain example.com and i want to block all bots from crawling a subdomain site subdomain.example.com made for resellers where they can buy at lower prices because the content is duplicate to the original site, and i am not exacly sure how to do it. Usualy if i want to block the bots for a site i would use
User-agent: *
Disallow: /
But how do i use it without hurting the whole store ? and is it possible to block the bots from the htacces too ?
Regarding your first question:
If you don't want search engines to gain access to the subdomain (sub.example.com/robots.txt), using a robots.txt file ON the subdomain is the way to go. Don't put it on your regular domain (example.com/robots.txt) - see Robots.txt reference guide.
Additionally, I would verify both domains in Google Search Console. There you can monitor and control the indexation of the subdomain and main domain.
Regarding your second question:
I've found a SO thread here which explains what you want to know: Block all bots/crawlers/spiders for a special directory with htaccess.
We use a canonical URL to tell the search engines where to find the original content.
https://yoast.com/rel-canonical/
A canonical URL allows you to tell search engines that certain similar
URLs are actually one and the same. Sometimes you have products or
content that is accessible under multiple URLs, or even on multiple
websites. Using a canonical URL (an HTML link tag with attribute
rel=canonical) these can exist without harming your rankings.

Blocking Google (and other search engines) from crawling domain

We want to open a new domain for certain purposes (call them PR). The thing is we want the domain to point to the same website we currently have.
We do not want this new domain to appear on search engines (specifically Google) at all.
Options we've ruled out:
Robots.txt can't be used - it will work the same on both domains, which isn't what we want.
The rel=canonical doesn't block - only suggests to index a similar page instead. The original page might end up being indexed.
Is there a way to handle this?
EDIT
Regarding .htaccess suggestions: we're on IIS7.
rel=canonical is not a suggestion. It tells Google exactly which page to use.
Having said that, when serving pages that are in the domain you do not want indexed you can use the `x-robots-tag- to block those pages from being indexed:
Simply add any supported META tag to a new X-Robots-Tag directive in
the HTTP Header used to serve the file.
Don't include this document in the Google search results:
X-Robots-Tag: noindex
Have you tried setting your preferred domain in Google Webmaster Tools?
The drawback to this approach is that it doesn't work for other search engines.
I would block via say a .htaccess file on the domain in question at the root of the site.
BrowserMatchNoCase SpammerRobot bad_bot
Order Deny,Allow
Deny from env=bad_bot
Where you'd have to specify the different bots used by the major search engines.
Or you could allow all known webbrowsers and white list them instead.

How to stop search engines from crawling the whole website?

I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?
Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.
It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all
If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/
In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />

Implications of not forwarding http:// to http://www

my company is running IIS and DNN (I'm not a server guy, so color me ignorant), and I've read previous that you should either redirect your .http://www.mydomain to .http://mydomain or Vice Versa. Can anyone give me reasons to do this? (periods "prepended" to remove href)
From what I understand, it's because search engines see those as two different 'sites' (Even when visiting one or the other, I can be logged into one but not the other).
I also heard it can be a duplicate content problem, which search engines dislike.
Just looking for some professional insight, will help me and others.
Thanks!
This allows your site to be more SEO-friendly. Search engine crawlers will view these as two different URLs. That will cause your site's ranking in search engines to have multiple rankings for the same content.
ScottGu describes the problem and how to go about fixing it in a blog post:
http://weblogs.asp.net/scottgu/archive/2010/04/20/tip-trick-fix-common-seo-problems-using-the-url-rewrite-extension.aspx
Although it's mostly for SEO, there is also a potential usability issue in that a user who logs in on www.domain.com may get cookies that only work on the www subdomain and will be forced to log in again if they ever follow a link to domain.com (without the www prefix).
In addition to the SEO-friendlyness this also prevents some errors that might come up when both, with and without www works.
for example a user could login on www.yourdomain.com and would receive a cookie. later he visit your site via yourdomain.com and the cookie would not apply there.

Resources