I need to perevent Alexa and Similar web from accessing my website completely.
I understand that it's can be done with robots.txt, but as far as i know it's not enough and they are collection data with simple extensions or something similar.
Any ideas or solutions?
Thank you in advance.
You need to block theire bots with robots.txt
Alexa bot: ia_archiver
This is how you block it:
User-agent: ia_archiver
Disallow: /
You need to find names of all the bots you want to block. It is pretty easy, just use Google search
Related
I am looking for some clarity on trying to block Google Bot from specific pages on my site but at the same time allowing them to be indexed in my Google Site Search(GSA). I cannot find a clear answer on this. This is my best guess.
User-agent: *
Disallow: /wp-admin/
Disallow: /example/custom/
User-Agent: gsa-crawler
Allow: /example/custom/
I would like to block Google Bot from indexing any pages with www.example.com/example/custom/ but at the same time index them with GSA. Would this be the correct implementation in my robots.txt file? Or would GSA need to go above User-agent: * ? Any insight is much appreciated.
Not sure if it can be helpful:
https://www.google.com/support/enterprise/static/gsa/docs/admin/72/gsa_doc_set/admin_crawl/preparing.html
Security tip: remember hackers search in robots.txt to see what dirs you want to "guard".
Cheers!
I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.
URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.
You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.
If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.
You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php
This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html
I'm looking for an advice and the method to so;
I have a folder on my domain where I am testing a certain landing page;
If it goes well I'll might build a new website and domain with this landing page,
and that's the main reasons I don't want it to get crawled, so I won't be punished by Google for duplicate content. I also don't want unwanted bots to scrape this landing page, as no good can come out of it. does it make sense to you?
If so, how can I do this? I don't think robots.txt is the best method as I understood that not all crawlers respect it, and even google may not fully respect it. I can't put a password since the landing page should be open to all humans (so the solution must not cause any problem to human visitors). does it leave the .htaccess file? If so, what code should I add there? are there any downsides I didn't get?
Thanks!
Use robots.txt file with following content:
User-agent: *
Disallow: /some-folder/
Dear Friends Need a big advice from you.
I have a website that i don't want any traffic from USA(website contains only local contents). Since most of the visitors comes to my website from search engines,I don't want to block those search engine bots.
I know how to
block ip addresses from .htaccess.
redirecting users from Geo location.
I think if I block USA ips then my website won't be indexed in Google or yahoo.So even I don't want any USA traffic I need my webpages to be indexed in Google and yahoo.
Depend on the $_SERVER['HTTP_USER_AGENT'] I can allow bots to crawl my webpages.
One of my friend told that if I block USA visitors except bots,Google will blacklist my website for disallowing Google indexed pages for USA visitors.
Is this true? If so what should I do for this problem? Any advices are greatly appreciated. thanks
using JS redirect for US users. This will allow most of the search engine bots to visit your website.
using Robots.txt to tell Google where and what to read
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
there is a way to add Googlebot's IP addresses (or just the name: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=80553) as an exception.
use the geotargeting and block the pages with a JS div or just add a banner that tell your users that they can't use the website from their location
hope this helps,
cheers,
I'm only answering this here because someone was smart enough to spam Google's "Webmaster Central Help Forum" with a link drop. I know this answer is more of a comment, but I blame the question, had it been asked in webmasters SE, this answer would be more on-topic.
1 Why block US visitors? I understand there can be some legal reasons (eg gambling). But you could just disable those features for US visitors and allow them to see the content and a banner that explains that the site is not available for US visitors. Search engines won't have any issue with that (they're incapable of gambling or purchasing stuff anyway) and there's no cloaking either.
2 Victor's answer contains a few issues IMO:
Using JS redirect for US users. This will allow most of the search engine bots to visit your website.
This was probably correct at the time of writing, but these days Google (and >probably some other search engines as well) are capable of running the JavaScript and will therefore also follow the redirect.
Using Robots.txt to tell Google where and what to read.
I'd suggest using the robots meta tag or X-Robots-Tag header instead, or respond with a 451 status code to all US visitors.
There is a way to add Googlebot's IP addresses as an exception.
Cloaking.
Use the geotargeting and block the pages with a JS div or just add a banner that tell your users that they can't use the website from their location.
Totally agree, do this.
In my opinion is is not wise.
e.g. check this:
http://edition.cnn.com/robots.txt
http://www.bbc.co.uk/robots.txt
http://www.guardian.co.uk/robots.txt
according from this:
http://www.joomla.org/robots.txt
Joomla.org have not changed the default administration folder :D
E.g. prestashp page has a blank robots.txt file which is not perfect, but at least better in my opinion:
http://www.prestashop.com/robots.txt
Are these people stupid or they think that it is ok to know how they web strtucture look like?
Why are they not using htaccess to deny access for robots etc?
The problem is that .htaccess can't intuitively tell that a visitor is a search engine bot.
Most bots will identify themselves in the user-agent string, but some won't.
Robots.txt is accessed by all the bots looking to index the site, and unscrupulous bots are not going to
Identify themselves as a bot
Pay any attention to robots.txt (or they will deliberately disobey it).