How to prevent bot/googlebot indexing promotional home page? - googlebot

We have an e-commerce website. Due to some marketing and promotional campaign we are showing app download page/banner/promotion/big image (and nothing else) on our home page if the user is visiting the site for the first time which is cookie based.
But I don't want bots/crawler to see this content(big image) instead they should see the real content which comes after setting up the cookie. URL is same for both the content.
I can clarify more on this. How can I avoid the bots seeing the promotional content?

You need a robots.txt file.
From Wikipedia:
The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to advising cooperating web crawlers and other web robots about accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.
Bear in mind that robots can simply ignore these directives if they are "evil"; however, Google and other search engines should abide by it provided you set it up correctly.

Well now I am using this function to detect bots/crawlers in php controller code and do the redirection as needed.
function bot_detected()
{
if
(
!isset($_SERVER['HTTP_USER_AGENT'])
||
empty($_SERVER['HTTP_USER_AGENT'])
||
preg_match('/bot|crawl|slurp|spider/i', $_SERVER['HTTP_USER_AGENT'])
||
preg_match('/scrappy/python/httpclient/Googlebot|DoCoMo|YandexBot|bingbot|ia_archiver|AhrefsBot|Ezooms|GSLFbot|WBSearchBot|Twitterbot|TweetmemeBot|Twikle|PaperLiBot|Wotbox|UnwindFetchor|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])
)
{
return TRUE;
}
return FALSE;
}

Related

Google Not Indexing Site - Says 'Blocked by Robots.txt' - However Robots.txt allows all crawlers -- Same problem with two different hosting services

I have built and published quite a few websites and never had the following issue:
Google is not indexing my website. Whenever I submit the page (in Google Search Console) it says "blocked by robots.txt" although the robots.txt allows every crawler (User-agent: * and Allow: /). The robots.txt is accessible via mydomain.com/robots.txt and the site's sitemap is accessible via mydomain.com/sitemap.
I have tried it with two different hosting providers: Dreamhost.com and Fastcomet.com. The issue persists however, and I cannot see why. The domains are registered with Namecheap.com which I have been using for many other sites since forever.
I use Grav CMS -- a terrific flat-file CMS -- which usually works flawlessly and I don't think that the CMS causes the problem.
Here below is a screenshot of Google's error message inside Google Search Console. Obviously, the robots.txt cannot be the culprit, since crawlers are allowed access.
Lastly, not even the domain is coming up in Google's search results. Usually, Google displays a domain without the accompanying description etc., if it is not allowed to crawl that domain.

What are those broken /webstore/detail/** and /track_install/search/** URLs?

I recently published a Chrome extension (Source Code) and now discover some broken incoming links on the extension's website which must be related to that extension:
/track_install/search/ext/free/mebkekakcnabgndiakbbefcgpedlaidp/mixcloud_downloader
/webstore/detail/ext/free/mebkekakcnabgndiakbbefcgpedlaidp/mixcloud_downloader
On the chrome extension webstore I don't find such links. Do you have any idea where those links come from and what's their purpose? Would users exepect anything else than a 404 on that URLs?
The website is referenced in the extension's manifest homepage_url field and on the webstore item in the "Websites" field.
Update: I just noticed again one such request where the referer comes from this question.
Normally, those URLs are relative to the Webstore and are used for Analytics tracking and this stat page (only available to you). See this mention, for example.
/track_install/... is, quite obviously, used as a beacon to track installs.
/webstore/detail/ext/free/... tracks opening your extension's listing in Web Store.
Here's documentation on homepage_url, which I believe influenced this, including this quote:
If you distribute your extension using the Chrome Web Store, the homepage URL defaults to the extension's own page.
I believe that it's either a bug that those are sent out to your server instead, or a feature I haven't seen documented anywhere to let you track those instead. Note that those are just beacons sent from analytics code; you don't need to serve content on them.
In any case, it's worth reporting, either on the bugtracker or via the exceptionally well-hidden developer support form.

AngularJs Routing without hashtag in link?

I've recently began learning AngularJs for web development and am loving it so far. However, I'm not so sure about having hashtags withing the link when routing between views. My main concern is how Google will cache the pages on the site and whether the links will work both ways, i.e. whether users can just click www.sampledomain.com/#/orders/450 and be directed straight to the order page. Is this an okay method or is there a way to route views without the hashtag?
When I remove the hashtag, the reload the page and gets 404 error. Can anyone give me a decent explanation of what is going on. Thanks
When I remove the hashtag, the reload the page and gets 404 error
That's because in your server side code you are probably not handling a request like "www.sampledomain.com/orders/450"
You can have your server-side code handle this request by either returning a redirect to the new URL ("www.sampledomain.com/#/orders/450") or just return the correct HTML directly. The "right" solution will depend on your needs.
User can just click link with a hashtag and it will be directed straight to the order page.
Google treats links with hashtags as different URL's when the content is different. It's more about SEO then angular.js, but here is an article about that: The First Link Counts Rule and the Hash Sign - Does it Change PR Sculpting?
You might want to set Angular's $locationProvider to use html5Mode.
FTA:
$location service has two configuration modes which control the format of the URL in the browser address bar: Hashbang mode (the default) and the HTML5 mode which is based on using the HTML5 History API. Applications use the same API in both modes and the $location service will work with appropriate URL segments and browser APIs to facilitate the browser URL change and history management.
html5Mode will give you "normal" urls in modern browsers while falling back to hash bangs on older browsers.
An html5Mode url:
http://foo.com/bar?baz=23#baz
a hashbang url:
http://foo.com/#!/bar?baz=23#baz

Updating an existing website

I've been asked by a family friend to completely overhaul the website for their business. I've designed my own website, so I know some of the basics of web design and development.
To work on their website from my own home, I know I'll need to FTP into their server, and therefore I'll need their FTP credentials, as well as their CMS credentials. I'm meeting with them in a couple of days and I don't want to look like a moron! Is there anything else I need to ask them for during our first meeting (aside from what they want in their new site, etc.) before I start digging into it?
Thanks!
From an SEO point of view, you should be concerned with 301 redirects as (i suppose) some or all URL adressess will change (take a different name, be removed and etc)
So, after you`ve created a new version of the site - and before you put it online - you should go ahead and list all "old site" URLs and decide, preferably for each one, it's new status (unchanged or redirected and if so - to what URL).
Mind that even is the some content will not re-appear on the new site, you still have to redirect the URL (say to HomePage) to keep link juice and SERP rankings.
Also, for a larger sites, (especially dynamic sites) try looking for URL patterns for bulk redirects. For example, if you see that google indexes 1,000 index.php?search=[some-key-word] pages, you don`t need to redirect each one individually as these are probably just search result pages that can be grouped with REGEX to be redirected to main search result page.
To index "old site" URLs you should:
a. site:domainname.com in Google (then set the SERP to 100 results and scaped manually of with Xpath)
b. Xenu or other site crawler (some like screamingfrog) to get a list of all URLs.
c. combine the lists in excel and remove all duplicates.
If you need help with 301 redirects you can start with this link:
http://www.webconfs.com/how-to-redirect-a-webpage.php/
If the website is static, knowing html, css and javascript along with FTP credentials is enough for you to get started. However if the site is dynamic interactive and database driven, you may need to ask if they want to use a php, In that case you might end up building this site in wordpress.
If you are going to design the website from scratch then also keep this point in mind.. Your friend might have hosted this website at somewhere (i.e. hosting provider). You should get its hosting control panel details as well which will help to manage the website (including database, email, FTP, etc.).

How to stop search engines from crawling the whole website?

I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?
Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.
It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all
If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/
In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />

Resources