How to Hide Drupal Site from Google?

How to Hide Drupal Site from Google? - search

I have a website, and I use Drupal for CMS. i.e. I write articles in mysite.com/drupal/ but I display them in mysite.com/show_article.php?article_id=x&article_name=y (actually mysite.com/article/x/y with .htaccess RewriteRule) by loading the content from Drupal database.
However, when I search for an article in google, mysite.com/drupal/node/x/y appears as the result, but mysite.com/article/x/y doesn't.
so I guess I need to add some <google nofollow> tags in drupal's php pages, but which ones? or is there an easier configuration setting for this?
Thanks !

robots.txt:
User-agent: *
disallow: /drupal/
Then you'll want to submit an index for mysite.com/article to Google. It sounds like you won't be able to use XML sitemap for this however.
That should be all you need to do what you need. Of course the obvious question (if I understand what you're doing) is why you're using custom php to serve up Drupal content from the same domain, and not just doing this in Drupal.

Related

Can Search Engine read robots.txt if it's read access is restricted?

I have added robots.txt file and added some lines to restrict some folders.Also i added restriction from all to access that robots.txt file using .htaccess file.Can Search engines read content of that file?

This file should be freely readable. Search engine are like visitors on your website. If a visitor can't see this file, then the search engine will not be able to see it either.
There's absolutely no reason to try to hide this file.

Web crawlers need to be able to HTTP GET your robots.txt, or they will be unable to parse the file and respect your configuration.

The answer is no! But the simplest and safest too, is still to try:
https://support.google.com/webmasters/answer/6062598?hl=en
The robots.txt Tester tool shows you whether your robots.txt file
blocks Google web crawlers from specific URLs on your site. For
example, you can use this tool to test whether the Googlebot-Image
crawler can crawl the URL of an image you wish to block from Google
Image Search.

Block Bots from crawling one of my sites on a multistore multidomain prestashop

Hello i have a multistore multidomain prestashop installation with main domain example.com and i want to block all bots from crawling a subdomain site subdomain.example.com made for resellers where they can buy at lower prices because the content is duplicate to the original site, and i am not exacly sure how to do it. Usualy if i want to block the bots for a site i would use
User-agent: *
Disallow: /
But how do i use it without hurting the whole store ? and is it possible to block the bots from the htacces too ?

Regarding your first question:
If you don't want search engines to gain access to the subdomain (sub.example.com/robots.txt), using a robots.txt file ON the subdomain is the way to go. Don't put it on your regular domain (example.com/robots.txt) - see Robots.txt reference guide.
Additionally, I would verify both domains in Google Search Console. There you can monitor and control the indexation of the subdomain and main domain.
Regarding your second question:
I've found a SO thread here which explains what you want to know: Block all bots/crawlers/spiders for a special directory with htaccess.

We use a canonical URL to tell the search engines where to find the original content.
https://yoast.com/rel-canonical/
A canonical URL allows you to tell search engines that certain similar
URLs are actually one and the same. Sometimes you have products or
content that is accessible under multiple URLs, or even on multiple
websites. Using a canonical URL (an HTML link tag with attribute
rel=canonical) these can exist without harming your rankings.

Preventing Crawlers (Google in particular), from crawling a certain folder in my domain?

I'm looking for an advice and the method to so;
I have a folder on my domain where I am testing a certain landing page;
If it goes well I'll might build a new website and domain with this landing page,
and that's the main reasons I don't want it to get crawled, so I won't be punished by Google for duplicate content. I also don't want unwanted bots to scrape this landing page, as no good can come out of it. does it make sense to you?
If so, how can I do this? I don't think robots.txt is the best method as I understood that not all crawlers respect it, and even google may not fully respect it. I can't put a password since the landing page should be open to all humans (so the solution must not cause any problem to human visitors). does it leave the .htaccess file? If so, what code should I add there? are there any downsides I didn't get?
Thanks!

Use robots.txt file with following content:
User-agent: *
Disallow: /some-folder/

Blocking Google (and other search engines) from crawling domain

We want to open a new domain for certain purposes (call them PR). The thing is we want the domain to point to the same website we currently have.
We do not want this new domain to appear on search engines (specifically Google) at all.
Options we've ruled out:
Robots.txt can't be used - it will work the same on both domains, which isn't what we want.
The rel=canonical doesn't block - only suggests to index a similar page instead. The original page might end up being indexed.
Is there a way to handle this?
EDIT
Regarding .htaccess suggestions: we're on IIS7.

rel=canonical is not a suggestion. It tells Google exactly which page to use.
Having said that, when serving pages that are in the domain you do not want indexed you can use the `x-robots-tag- to block those pages from being indexed:
Simply add any supported META tag to a new X-Robots-Tag directive in
the HTTP Header used to serve the file.
Don't include this document in the Google search results:
X-Robots-Tag: noindex

Have you tried setting your preferred domain in Google Webmaster Tools?
The drawback to this approach is that it doesn't work for other search engines.

I would block via say a .htaccess file on the domain in question at the root of the site.
BrowserMatchNoCase SpammerRobot bad_bot
Order Deny,Allow
Deny from env=bad_bot
Where you'd have to specify the different bots used by the major search engines.
Or you could allow all known webbrowsers and white list them instead.

Symbol "?" in alias or Dirty url

I want to move the website to the Drupal CMS with original paths. It's look like
website.com/search.php?q=blablabla
website.com/index.php?q=blablabla
website.com/category.php?q=auto&page=2
etc
How can i use these aliases in Drupal? Thank you.

I think you will have great difficulty setting this up, if it's even possible. It would be much better to let Drupal use its standard clean URLs and to setup URL rewrite rules to translate requests for legacy URLs to the new ones.
For example, Drupal's search URL looks like:
website.com/search/node/blahblah
And in .htaccess you could define:
RewriteRule ^search.php\?q=(.*)$ /search/node/$1 [R=301,NC,L]
Which would match the format of your legacy search URL, extract the query and rewrite the URL so that query is in Drupal's clean form. That way requests to website.com/search.php?q=blah get translated to website.com/search/node/blah before getting sent to Drupal. The user however will see the new, Drupal-style URL.
mod_rewrite is well documented.
This is of course going to be harder to do if your legacy URLs make use of unique IDs that do not exist in Drupal. In that case I'd take care to make sure that node IDs and taxonomy IDs etc all correspond between your legacy site and your new site. That way you could translate something like /view.php?articleID=121 to /node/121.
This has the effect of handling any incoming links from search engines, third party sites, or users' bookmarks, but leaves you with an entirely new URL structure. I've used this approach before when migrating to Drupal.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to Hide Drupal Site from Google? - search

Related

Can Search Engine read robots.txt if it's read access is restricted?

Block Bots from crawling one of my sites on a multistore multidomain prestashop

Preventing Crawlers (Google in particular), from crawling a certain folder in my domain?

Blocking Google (and other search engines) from crawling domain

Symbol "?" in alias or Dirty url

Categories

Resources