Stop Google from indexing [closed] - meta-tags

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 days ago.
The community reviewed whether to reopen this question 9 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Is there a way to stop Google from indexing a site?

robots.txt
User-agent: *
Disallow: /
this will block all search bots from indexing.
for more info see:
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360

Remember that preventing Google from crawling doesn't mean you can keep your content private.
My answer is based on few sources: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started
https://sites.google.com/site/webmasterhelpforum/en/faq--crawling--indexing---ranking
robots.txt file controls crawling, but not indexing! Those two are completely different actions, performed separately. Some pages may be crawled but not indexed, and some may even be indexed but never crawled. The link to non-crawled page may exist on other websites, which will make Google indexer to follow it, and try to index.
Question is about indexing which is gathering data about the page so it may be available through search results. It can be blocked adding meta tag:
<meta name="robots" content="noindex" />
or adding HTTP header to response:
X-Robots-Tag: noindex
If the question is about crawling then of course you could create robots.txt file and put following lines:
User-agent: *
Disallow: /
Crawling is an action performed to gather information about the structure of one specific website. E.g. you've added the site through Google Webmaster Tools. Crawler will take it on account, and visit your website, searching for robots.txt. If it doesn't find any, then it will assume that it can crawl anything (it's very important to have sitemap.xml file as well, to help in this operation, and specify priorities and define change frequencies). If it finds the file, it will follow the rules. After successful crawling it will at some point run indexing for crawled pages, but you can't tell when...
Important: this all means that your page can still be shown in Google search results regardless of robots.txt.

There are several way to stop crawlers including Google to stop crawling and indexing your website.
At server level through header
Header set X-Robots-Tag "noindex, nofollow"
At root domain level through robots.txt file
User-agent: *
Disallow: /
At page level through robots meta tag
<meta name="robots" content="nofollow" />
However, I must say if your website has outdated and not existing pages/urls then you should wait for sometime Google will automatically deindex those urls in next crawl - read https://support.google.com/webmasters/answer/1663419?hl=en

You can disable this server wide by adding the below setting in globally in apache conf or the same parameters can be used in vhost for disabling it for particular vhost only.
Header set X-Robots-Tag "noindex, nofollow"
Once this is done you can test it by verifying apache headers returned.
curl -I staging.mywebsite.com
HTTP/1.1 302 Found Date: Sat, 26 Nov
2016 22:36:33 GMT Server: Apache/2.4.18 (Ubuntu)
Location: /pages/
X-Robots-Tag: noindex, nofollow
Content-Type: text/html; charset=UTF-8

Bear in mind that microsoft's crawler for Bing, despite their claim to obey robots.txt, does not always do so.
Our server stats indicate that they have a number of IP's that run crawlers that do not obey robots.txt as well as a number of ones that do.

I use a simple aspx page to relays results from google to my browser using a fake 'Pref' cookie that gets 100 results at a time and i didn't want google to see this relay page so i check the IP address and if it starts with 66.249 then i simply do a redirect.
Click my name if you value privacy and would like a copy.
another trick i use is to have some javascript that calls a page to set a flag in session because most (NOT ALL) web-bots don't execute the javascript so you know it's a brower with javascript turned off or is a more than likly a bot.

Also you can add the meta robots in this way:
<head>
<title>...</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
</head>
And another extra layer is to modify .htaccess, but you need to check it deeply.

use a nofollow meta tag:
<meta name="robots" content="nofollow" />
To specify nofollow at the link level, add the attribute rel with the value nofollow to the link:
<a href="example.html" rel="nofollow" />

Is there a way to stop Google from indexing a site?
To stop Google from crawling simply add the following meta tag to the head of every page:
<meta name="googlebot" content="noindex, nofollow">

Related

Disallow In-Page Url Crawls

I want to disallow all the bots to crawl specific type of pages. I know this can be done via robots.txt as well as .htaccess. However, these pages are generated from the database from the user's request. I have searched the internet and could not get a good answer for doing so.
My link looks like:
http://www.my_website/some_controller/some_action/download?id=<encrypted_id>
There is a view page for the users wherein all the data that is displayed comes from the database including the kind of links that I have mentioned before. I want to hide those links from the bots and not the entire page. How can I do that?
Could the page not be generated with a
<meta name="robots" content="noindex">
in the head?
you cannot hide stuff from bots but make it available to other traffic, afterall how do you distinguish between a bot and regular traffic... you cant without some sort of verification like them pictures of a word you type in a box.
Robots.txt does not stop bots, most bots will look at it and that will stop them out of there own choice, however that is only because they are programmed to do so. They do not have to do this and therefore if they wish can ignore robots.txt completely.

Security concerns using robots.txt

I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.
URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.
You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.
If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.
You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php
This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html

Blocking Google (and other search engines) from crawling domain

We want to open a new domain for certain purposes (call them PR). The thing is we want the domain to point to the same website we currently have.
We do not want this new domain to appear on search engines (specifically Google) at all.
Options we've ruled out:
Robots.txt can't be used - it will work the same on both domains, which isn't what we want.
The rel=canonical doesn't block - only suggests to index a similar page instead. The original page might end up being indexed.
Is there a way to handle this?
EDIT
Regarding .htaccess suggestions: we're on IIS7.
rel=canonical is not a suggestion. It tells Google exactly which page to use.
Having said that, when serving pages that are in the domain you do not want indexed you can use the `x-robots-tag- to block those pages from being indexed:
Simply add any supported META tag to a new X-Robots-Tag directive in
the HTTP Header used to serve the file.
Don't include this document in the Google search results:
X-Robots-Tag: noindex
Have you tried setting your preferred domain in Google Webmaster Tools?
The drawback to this approach is that it doesn't work for other search engines.
I would block via say a .htaccess file on the domain in question at the root of the site.
BrowserMatchNoCase SpammerRobot bad_bot
Order Deny,Allow
Deny from env=bad_bot
Where you'd have to specify the different bots used by the major search engines.
Or you could allow all known webbrowsers and white list them instead.

SEO - Getting a 301 page indexed by search engines

I have a site (say site1.com) which 301-redirects to another page on a different site (say http://site2.com/some/dirty/url).
Typical code at site1.com:
<?php
header("HTTP/1.1 301"); header("refresh:0;url=http://site2.com/some/dirty/url");
?><html>
<head>
<title>
Site 1 - heading.
</title>
<meta name="description" content="some description" />
</head>
<body />
</html>
Typically, Search Engines never index site1.com, even when there are external links like:
Click Here
But this is considered as an external link to http://site2.com/some/dirty/url and thus http://site2.com/some/dirty/url is seo'd.
I some how want to get site1.com indexed (Just the title, meta description and URL) though http://site2.com/some/dirty/url getting indexed is not a problem. Is this really possible or is it just what I have to forget about?
The 301 redirect tells search engines, and any other user agent that respects HTTP status codes, that http://site.com no longer exists and has moved to a new location. This means they now consider the new location of http://site.com to be http://site2.com/some/dirty/url and to associate everything, including all links to http://site.com to be associated with http://site2.com/some/dirty/url. So basically http://site.com does not exist anymore and no matter how many links you point to it, it won't change anything since they now will be associated with http://site2.com/some/dirty/url. And that makes sense since a 301 HTTP status does indicate that a page has moved permanently. If that page hasn't moved permanently then you are using the wrong HTTP status code.
Yes,It can be indexed......But it requires a better on page work on the both of your sites
(http://site.com) and (http://site2.com/some/dirty/url) .............
For example I have recently worked on the same conditions the website url is "http://www.top-alliance.de" which redirects to "http://www.top-alliance.com" and these both sites are indexed by the search engine recently by 04 June 2012.This is happened because i have done a better onpage work for both pages...
So the conclusion is the both your sites will require better on page work so it will definitely indexed by the search engine.
Thanks & Regards
Nitin Bhatnagar
To easily create redirects in your WordPress, an alternative is a simple 301 redirect plugin. Once you've installed and activated the plugin, add a new menu in the Settings area of ​​your dashboard.
There is really nothing to worry about with this plugin. The 301 Redirect Configuration window shows you two simple fields. One labeled as a request and the other as a destination. This is basically where the old permanent link structure and the new permanent link structure come from. You only need to add information after your domain name in these fields.
In the example above, the request field is the WordPress setting for the month and name Permilix, while the destination field is the WordPress setting for the post name Permalink structure. After you add these two fields, save your changes. It will ask any search engine traffic to come back to the old links.

How to stop search engines from crawling the whole website?

I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?
Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.
It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all
If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/
In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />

Resources