Does google look further then public_html - .htaccess

We have a VPS and access with WHM/cPanel and we would like to know following:
Does Googles crawler see/crawle subdomains even if they aint pointing to the
public_html (and visa versa) and are not mentioned in Google Webmastertools?
Note: we have taken precautions through .htaccess and robots.txt and use <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> but still found some in google Webmaster back-end which
we don't understand.
(we have a test.ourdomain.com for developing new "stuff" and so on, therefore my question)

Remove external Urls in Blocked pages. Sometimes spiders will crawl the page if there is any genuine urls.
The simplest and most effective way to block private URLs from appearing is to store them in a password-protected directory on your site server. Googlebot and all other web crawlers are unable to access content in password-protected directories.
For more information https://support.google.com/webmasters/answer/93708

Related

Disallow In-Page Url Crawls

I want to disallow all the bots to crawl specific type of pages. I know this can be done via robots.txt as well as .htaccess. However, these pages are generated from the database from the user's request. I have searched the internet and could not get a good answer for doing so.
My link looks like:
http://www.my_website/some_controller/some_action/download?id=<encrypted_id>
There is a view page for the users wherein all the data that is displayed comes from the database including the kind of links that I have mentioned before. I want to hide those links from the bots and not the entire page. How can I do that?
Could the page not be generated with a
<meta name="robots" content="noindex">
in the head?
you cannot hide stuff from bots but make it available to other traffic, afterall how do you distinguish between a bot and regular traffic... you cant without some sort of verification like them pictures of a word you type in a box.
Robots.txt does not stop bots, most bots will look at it and that will stop them out of there own choice, however that is only because they are programmed to do so. They do not have to do this and therefore if they wish can ignore robots.txt completely.

Block Bots from crawling one of my sites on a multistore multidomain prestashop

Hello i have a multistore multidomain prestashop installation with main domain example.com and i want to block all bots from crawling a subdomain site subdomain.example.com made for resellers where they can buy at lower prices because the content is duplicate to the original site, and i am not exacly sure how to do it. Usualy if i want to block the bots for a site i would use
User-agent: *
Disallow: /
But how do i use it without hurting the whole store ? and is it possible to block the bots from the htacces too ?
Regarding your first question:
If you don't want search engines to gain access to the subdomain (sub.example.com/robots.txt), using a robots.txt file ON the subdomain is the way to go. Don't put it on your regular domain (example.com/robots.txt) - see Robots.txt reference guide.
Additionally, I would verify both domains in Google Search Console. There you can monitor and control the indexation of the subdomain and main domain.
Regarding your second question:
I've found a SO thread here which explains what you want to know: Block all bots/crawlers/spiders for a special directory with htaccess.
We use a canonical URL to tell the search engines where to find the original content.
https://yoast.com/rel-canonical/
A canonical URL allows you to tell search engines that certain similar
URLs are actually one and the same. Sometimes you have products or
content that is accessible under multiple URLs, or even on multiple
websites. Using a canonical URL (an HTML link tag with attribute
rel=canonical) these can exist without harming your rankings.

How to stop search engines from crawling the whole website?

I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?
Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.
It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all
If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/
In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />

Question regarding sitemaps

I am storing my sitemaps in my web folder. I want web crawlers (Googlebot etc) to be able to access the file, but I dont necessarily want all and sundry to have access to it.
For example, this site (stackoverflow.com), has a site index - as specified by its robots.txt file (https://stackoverflow.com/robots.txt).
However, when you type https://stackoverflow.com/sitemap.xml, you are directed to a 404 page.
How can I implement the same thing on my website?
I am running a LAMP website, also I am using a sitemap index file (so I have multiple site maps for the site). I would like to use the same mechanism to make them unavailable via a browser, as described above.
First, decide which networks you want to get your actual sitemap.
Second, configure your web server to grant requests from those networks for your sitemap file, and configure your web server to redirect all other requests to your 404 error page.
For nginx, you're looking to stick something like allow 10.10.10.0/24; into a location block for the sitemap file.
For apache, you're looking to use mod_authz_host's Allow directive in a <Files> directive for the sitemap file.
You can check the user-agent header the client sends, and only pass the sitemap to known search bots. However, this is not really safe since the user-agent header is easily spoofed.
Stack Overflow presumably checks two things when deciding who gets access to the sitemaps:
The USER_AGENT string
The originating IP address
both will probably be matched against a database of known legitimate bots.
The USER_AGENT string is pretty easy to check in a server side language; it is also very easy to fake. More info:
For how to check the USER_AGENT string Way to tell bots from human visitors?
For instructions on IP checking Google: Google Webmaster Central: How to verify Googlebot
Related: Allowing Google to bypass CAPTCHA verification - sensible or not?

How to get a domain un-indexed by search engines

I have a domain with a loto of indexed pages, I use this one as a online test domain. I understand that I should test it on a intranet or somewhat, but in time Google indexed a few websites which are not relavent anymore.
Does anyone know how to get a domain totlally unindexed from the most search engines?
There is a couple things you can do.
Set up a restrictive robots.txt file
Password protect the domain root
Request removal directly from SEs
If you have a static ip and you are the only one accessing the site, you can simply deny access to any ips other than yours.
Place a robots.txt file in the root directory of your webpage. It can be used to control how much access search engine spiders have to your content. You can specify certain areas of your site off limits to indexing, on a directory-by-directory basis.
Remove alias domain if you have
Remove url redirect from old to new
so that Search Engines can slowly de-index your old domain.

Resources