Blocking USA traffic without blocking search engine bots?

Blocking USA traffic without blocking search engine bots? - .htaccess

Dear Friends Need a big advice from you.
I have a website that i don't want any traffic from USA(website contains only local contents). Since most of the visitors comes to my website from search engines,I don't want to block those search engine bots.
I know how to
block ip addresses from .htaccess.
redirecting users from Geo location.
I think if I block USA ips then my website won't be indexed in Google or yahoo.So even I don't want any USA traffic I need my webpages to be indexed in Google and yahoo.
Depend on the $_SERVER['HTTP_USER_AGENT'] I can allow bots to crawl my webpages.
One of my friend told that if I block USA visitors except bots,Google will blacklist my website for disallowing Google indexed pages for USA visitors.
Is this true? If so what should I do for this problem? Any advices are greatly appreciated. thanks

using JS redirect for US users. This will allow most of the search engine bots to visit your website.
using Robots.txt to tell Google where and what to read
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
there is a way to add Googlebot's IP addresses (or just the name: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=80553) as an exception.
use the geotargeting and block the pages with a JS div or just add a banner that tell your users that they can't use the website from their location
hope this helps,
cheers,

I'm only answering this here because someone was smart enough to spam Google's "Webmaster Central Help Forum" with a link drop. I know this answer is more of a comment, but I blame the question, had it been asked in webmasters SE, this answer would be more on-topic.
1 Why block US visitors? I understand there can be some legal reasons (eg gambling). But you could just disable those features for US visitors and allow them to see the content and a banner that explains that the site is not available for US visitors. Search engines won't have any issue with that (they're incapable of gambling or purchasing stuff anyway) and there's no cloaking either.
2 Victor's answer contains a few issues IMO:
Using JS redirect for US users. This will allow most of the search engine bots to visit your website.
This was probably correct at the time of writing, but these days Google (and >probably some other search engines as well) are capable of running the JavaScript and will therefore also follow the redirect.
Using Robots.txt to tell Google where and what to read.
I'd suggest using the robots meta tag or X-Robots-Tag header instead, or respond with a 451 status code to all US visitors.
There is a way to add Googlebot's IP addresses as an exception.
Cloaking.
Use the geotargeting and block the pages with a JS div or just add a banner that tell your users that they can't use the website from their location.
Totally agree, do this.

Related

Website A 'redirect' to subdomain of website B, with content of website A

There has been a question made towards me recently to do the following:
We have a website with Drupal running in IIS.
On that site is an URL Redirect to a website hosted externally, obviously with a name completely irrelevant to the name of our company.
The question now is the following;
They want to change to URL to a subdomain of our website. Example: from "www.external-site.com" to "www.sub.internal.com" (while still showing content of the external website)
They want the current page of that website to be reflected in the URL bar. So it wouldn't say "www.sub.internal.com", but it would say "www.sub.internal.com/solutions/page1.html" (instead of "www.external-site.com/solutions/page1.html")
It's possible that I forgot another 'condition' but have mentioned before this.
So, if someone visits through our URL Redirect to External-website, it needs to show our subdomain instead of their domain in the URL, AND it needs to show the current page when people start browsing while still using our subdomain in the URL.
Now, I checked the external-website, and it seems that most of the links available are relative links (if this would be any useful information).
Currently, the external website is hosted externally, and will remain to be so for next few years. (I believe we bought the company)
I have been asking around and looking up, and the best possible thing seems to use domain forwarding, but even then it still doesn't seem to comply with the entirety that they asked of me.
I am but a 'simple' .NET programmer, held responsible to do support for anything involving the websites, and I can't say I have extended knowledge about infrastructure. (But I can ask people to do this for me)
Is there anything that could solve this?
Thanks so much!

IIS's URL rewite and Application Request Routing (ARR) combo can help you what you want to achive. Here are few links which may guide you to configure ARR. Please note that these links dont exibit exact solution to your problem however you can take clue from it and fabricate your solution accordingly.
http://www.iis.net/learn/extensions/url-rewrite-module/reverse-proxy-with-url-rewrite-v2-and-application-request-routing
http://www.iis.net/learn/extensions/url-rewrite-module/reverse-proxy-rule-template

It sounds like you'll want to use a full-page iframe: do not redirect but show a page with an "inner page" instead: that inner page is the external web site. That way, users do not see the external site in their URL bar.
http://webdesign.about.com/od/iframes/a/aaiframe.htm

You need to configure the equivalent of Apache Virtual Host with Reverse Proxy on IIS.
See this answers:
https://serverfault.com/a/271030
and
https://stackoverflow.com/a/10003306/2131693

Security concerns using robots.txt

I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.

URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.

You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.

If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.

You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php

This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html

Updating an existing website

I've been asked by a family friend to completely overhaul the website for their business. I've designed my own website, so I know some of the basics of web design and development.
To work on their website from my own home, I know I'll need to FTP into their server, and therefore I'll need their FTP credentials, as well as their CMS credentials. I'm meeting with them in a couple of days and I don't want to look like a moron! Is there anything else I need to ask them for during our first meeting (aside from what they want in their new site, etc.) before I start digging into it?
Thanks!

From an SEO point of view, you should be concerned with 301 redirects as (i suppose) some or all URL adressess will change (take a different name, be removed and etc)
So, after you`ve created a new version of the site - and before you put it online - you should go ahead and list all "old site" URLs and decide, preferably for each one, it's new status (unchanged or redirected and if so - to what URL).
Mind that even is the some content will not re-appear on the new site, you still have to redirect the URL (say to HomePage) to keep link juice and SERP rankings.
Also, for a larger sites, (especially dynamic sites) try looking for URL patterns for bulk redirects. For example, if you see that google indexes 1,000 index.php?search=[some-key-word] pages, you don`t need to redirect each one individually as these are probably just search result pages that can be grouped with REGEX to be redirected to main search result page.
To index "old site" URLs you should:
a. site:domainname.com in Google (then set the SERP to 100 results and scaped manually of with Xpath)
b. Xenu or other site crawler (some like screamingfrog) to get a list of all URLs.
c. combine the lists in excel and remove all duplicates.
If you need help with 301 redirects you can start with this link:
http://www.webconfs.com/how-to-redirect-a-webpage.php/

If the website is static, knowing html, css and javascript along with FTP credentials is enough for you to get started. However if the site is dynamic interactive and database driven, you may need to ask if they want to use a php, In that case you might end up building this site in wordpress.

If you are going to design the website from scratch then also keep this point in mind.. Your friend might have hosted this website at somewhere (i.e. hosting provider). You should get its hosting control panel details as well which will help to manage the website (including database, email, FTP, etc.).

How to stop search engines from crawling the whole website?

I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?

Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.

It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all

If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/

In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />

How to Allow Only Google, MSN/Yahoo bot access in .htaccess

i need help to only allow Google bot and Yahoo/MSN bot access to my site through .htaccess. Any help greatly appreciated.
For Google i got, not sure if that is right...
Allow from googlebot.com google.com google-analytics.com
Satisfy Any

I think your reasons for doing this are probably questionable, but the only way to really do this is by the reported User-agent (a HTTP request header), not by domain - and the reported user-agent can easily be spoofed by anyone. (This is also usually controlled through a robots.txt, but is typically for the opposite purpose - restricting crawlers, not normal users.) The servers that Google and others use to crawl sites won't have the same names or IPs as the names you listed.
For Google, some additional and official details of this are available at http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1061943 . Yahoo and MSN will have similar pages.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string