Google Meta Bots - meta-tags

I have a question:
If I put this code in my website:
<META NAME="robot" CONTENT="noindex,nofollow">
The google won't search the specific page, right? not all website.

With NOINDEX, Google will not include any content from this page in its index (The page will be invisible to Google searches).
With NOFOLLOW, Google bot will not try and follow any link on this page, hence these pages that the current page links to will not be included in Google index, unless they can be reached (by Google bot) in other ways.
Beware that the snippet in the question uses the wrong name for this META tag. ROBOTS needs to be in its plural form, not ROBOT. While upper/lower/mixed casing doesn't matter, I do not believe the bots will try both names.
BTW, <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> is equivalent to
<META NAME="ROBOTS" CONTENT="NONE">
And, yes! The rest of the web site will be indexed by Google as normally, unless of other bot exclusions.
The official word on the way Google bots interpret the META tags can be found on the Official Google WebMaster Central Blog

Related

How can I show a picture before a link in google sites

I have a website that i've set up through google sites. I have a link to an external webpage. What I'd really like to have happen is, if someone clicks the link, it shows a jpg picture for about 5 seconds and then forwards them off to the linked website. Is there a way to do that?
Thanks,
Rich
Adding to following tag:
<META http-equiv="refresh" content="5;URL=http://example.com">
to the <head> section of a webpage will redirect the user to example.com, or whatever the URL value is. You can display an image in the <body> section of this page. This seems like the simplest way to accomplish what you want.

Disallow In-Page Url Crawls

I want to disallow all the bots to crawl specific type of pages. I know this can be done via robots.txt as well as .htaccess. However, these pages are generated from the database from the user's request. I have searched the internet and could not get a good answer for doing so.
My link looks like:
http://www.my_website/some_controller/some_action/download?id=<encrypted_id>
There is a view page for the users wherein all the data that is displayed comes from the database including the kind of links that I have mentioned before. I want to hide those links from the bots and not the entire page. How can I do that?
Could the page not be generated with a
<meta name="robots" content="noindex">
in the head?
you cannot hide stuff from bots but make it available to other traffic, afterall how do you distinguish between a bot and regular traffic... you cant without some sort of verification like them pictures of a word you type in a box.
Robots.txt does not stop bots, most bots will look at it and that will stop them out of there own choice, however that is only because they are programmed to do so. They do not have to do this and therefore if they wish can ignore robots.txt completely.

Does google look further then public_html

We have a VPS and access with WHM/cPanel and we would like to know following:
Does Googles crawler see/crawle subdomains even if they aint pointing to the
public_html (and visa versa) and are not mentioned in Google Webmastertools?
Note: we have taken precautions through .htaccess and robots.txt and use <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> but still found some in google Webmaster back-end which
we don't understand.
(we have a test.ourdomain.com for developing new "stuff" and so on, therefore my question)
Remove external Urls in Blocked pages. Sometimes spiders will crawl the page if there is any genuine urls.
The simplest and most effective way to block private URLs from appearing is to store them in a password-protected directory on your site server. Googlebot and all other web crawlers are unable to access content in password-protected directories.
For more information https://support.google.com/webmasters/answer/93708

Bot function after NoFollow rule

I was just wondering what the function of googlebot or any other search engine spider/bot was after you use the no follow rule in a meta tag. Presumably the bot is on your site and gets to a page through link redirection, etc but if the linked page includes the code <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">, where does the bot go after that? Does it go back to the previous page or does it do some other function? Hope this doesn't sound like a stupid question but I was just curious.
usually, a web crawler does not visit links found on a given webpage directly when he encounters them, instead these links are added to a waiting list, when the spider finish loading the current page, he just look up into this list and pop another url from there, the new link is not necessary from the last fetched page, it can be from the previous page or even another website ( depending how the list is organized ).

Stop Google from indexing [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 days ago.
The community reviewed whether to reopen this question 9 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Is there a way to stop Google from indexing a site?
robots.txt
User-agent: *
Disallow: /
this will block all search bots from indexing.
for more info see:
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360
Remember that preventing Google from crawling doesn't mean you can keep your content private.
My answer is based on few sources: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started
https://sites.google.com/site/webmasterhelpforum/en/faq--crawling--indexing---ranking
robots.txt file controls crawling, but not indexing! Those two are completely different actions, performed separately. Some pages may be crawled but not indexed, and some may even be indexed but never crawled. The link to non-crawled page may exist on other websites, which will make Google indexer to follow it, and try to index.
Question is about indexing which is gathering data about the page so it may be available through search results. It can be blocked adding meta tag:
<meta name="robots" content="noindex" />
or adding HTTP header to response:
X-Robots-Tag: noindex
If the question is about crawling then of course you could create robots.txt file and put following lines:
User-agent: *
Disallow: /
Crawling is an action performed to gather information about the structure of one specific website. E.g. you've added the site through Google Webmaster Tools. Crawler will take it on account, and visit your website, searching for robots.txt. If it doesn't find any, then it will assume that it can crawl anything (it's very important to have sitemap.xml file as well, to help in this operation, and specify priorities and define change frequencies). If it finds the file, it will follow the rules. After successful crawling it will at some point run indexing for crawled pages, but you can't tell when...
Important: this all means that your page can still be shown in Google search results regardless of robots.txt.
There are several way to stop crawlers including Google to stop crawling and indexing your website.
At server level through header
Header set X-Robots-Tag "noindex, nofollow"
At root domain level through robots.txt file
User-agent: *
Disallow: /
At page level through robots meta tag
<meta name="robots" content="nofollow" />
However, I must say if your website has outdated and not existing pages/urls then you should wait for sometime Google will automatically deindex those urls in next crawl - read https://support.google.com/webmasters/answer/1663419?hl=en
You can disable this server wide by adding the below setting in globally in apache conf or the same parameters can be used in vhost for disabling it for particular vhost only.
Header set X-Robots-Tag "noindex, nofollow"
Once this is done you can test it by verifying apache headers returned.
curl -I staging.mywebsite.com
HTTP/1.1 302 Found Date: Sat, 26 Nov
2016 22:36:33 GMT Server: Apache/2.4.18 (Ubuntu)
Location: /pages/
X-Robots-Tag: noindex, nofollow
Content-Type: text/html; charset=UTF-8
Bear in mind that microsoft's crawler for Bing, despite their claim to obey robots.txt, does not always do so.
Our server stats indicate that they have a number of IP's that run crawlers that do not obey robots.txt as well as a number of ones that do.
I use a simple aspx page to relays results from google to my browser using a fake 'Pref' cookie that gets 100 results at a time and i didn't want google to see this relay page so i check the IP address and if it starts with 66.249 then i simply do a redirect.
Click my name if you value privacy and would like a copy.
another trick i use is to have some javascript that calls a page to set a flag in session because most (NOT ALL) web-bots don't execute the javascript so you know it's a brower with javascript turned off or is a more than likly a bot.
Also you can add the meta robots in this way:
<head>
<title>...</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
</head>
And another extra layer is to modify .htaccess, but you need to check it deeply.
use a nofollow meta tag:
<meta name="robots" content="nofollow" />
To specify nofollow at the link level, add the attribute rel with the value nofollow to the link:
<a href="example.html" rel="nofollow" />
Is there a way to stop Google from indexing a site?
To stop Google from crawling simply add the following meta tag to the head of every page:
<meta name="googlebot" content="noindex, nofollow">

Resources