I am looking for some clarity on trying to block Google Bot from specific pages on my site but at the same time allowing them to be indexed in my Google Site Search(GSA). I cannot find a clear answer on this. This is my best guess.
User-agent: *
Disallow: /wp-admin/
Disallow: /example/custom/
User-Agent: gsa-crawler
Allow: /example/custom/
I would like to block Google Bot from indexing any pages with www.example.com/example/custom/ but at the same time index them with GSA. Would this be the correct implementation in my robots.txt file? Or would GSA need to go above User-agent: * ? Any insight is much appreciated.
Not sure if it can be helpful:
https://www.google.com/support/enterprise/static/gsa/docs/admin/72/gsa_doc_set/admin_crawl/preparing.html
Security tip: remember hackers search in robots.txt to see what dirs you want to "guard".
Cheers!
Related
I need to perevent Alexa and Similar web from accessing my website completely.
I understand that it's can be done with robots.txt, but as far as i know it's not enough and they are collection data with simple extensions or something similar.
Any ideas or solutions?
Thank you in advance.
You need to block theire bots with robots.txt
Alexa bot: ia_archiver
This is how you block it:
User-agent: ia_archiver
Disallow: /
You need to find names of all the bots you want to block. It is pretty easy, just use Google search
I have a website on a production server and I have changes to the site I'll like to test on another webserver.
Is there a way to avoid Google's SEO on the test website. Maybe setup in the web.config?
Use this piece of code in your robot.txt file:
User-agent: *
Disallow: /
This will stop the search engines from crawling your webpage.
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt
So if you want to disallow all the search engines then upload a robots.txt file on your webserver.
and include following piece of code:
User-agent: *
Disallow: /
This will stop all the search engines from crawling.
and when you will put it on your production server. Change the piece of code to(in robots.txt file):
User-agent: *
Disallow:
Sitemap: http://www.yourdomainname.com/sitemap.xml
and also include a sitemap.xml file.
Remember, The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
I created a website www.example.com. I created a mobile version of the website with subdomain www.m.example.com. I used htaccess file for redirectiong to mobile version in smartphones. I put my mobile website's files in folder named "mobile". I put a robot.txt file in main root folder for prevent indexing mobile urls in search engines result.
my robot.txt file is like this.
User-agent: *
Disallow: /mobile/
I also put a robot.txt file in folder named mobile.
User-agent: *
Disallow: /
My problem is that.
In desktop version all result and snippets are correct.
but when i searching in mobil, the result in snippet shows like this.
A description for this result is not available because of this site's robots.txt – learn more
How to solve this?
By using this robots.txt on www.m.example.com
User-agent: *
Disallow: /
you are forbidding bots to crawl any resource on www.m.example.com.
If bots are not allowed to crawl, they can’t access your meta-description.
So everything is working as intended.
If you want your pages to get crawled (and indexed), you have to allow it in your robots.txt (or remove it altogether).
By using the canonical link type, you can denote that two (or more) pages are the same, or that they only have trivial differences (e.g., different HTML structure, table sorted differently etc.), or that one is the superset of the other.
By using the alternate link type, you can denote that it’s an alternate representation of essentially the same content.
(You can see examples in my answer on Webmasters SE.)
I want to hide one individual page from google. How can I do it?
I have a UserControl for this page.
Tnx in advance
Try the robots.txt approach first. Refer to the description here http://www.robotstxt.org/robotstxt.html.
Write a robots.txt in the root of your site, make it accessible to anybody for read and put
User-agent: *
Disallow: /<your_page_url>
there
Will this robots.txt file only allow googlebot to index my site's index.php file? CAVEAT, I have an htaccess redirect that people who type in
http://www.example.com/index.php
are redirected to simply
http://www.example.com/
So, this is my robots.txt file content...
User-agent: Googlebot
Allow: /index.php
Disallow: /
User-agent: *
Disallow: /
Thanks in advance!
Not really.
Good bots
Only "good" bots follow the robots.txt instructions (not all robots and spiders bother to read/follow robots.txt). That might not even include all the main search engine's bots, but it definitely mean that some web crawlers will just completely ignore your requests (you should look at using .htaccess or password protection if you really want to stop bots/crawlers from seeing parts of your site).
Second checks
Google makes multiple visits to your website, including appearing as a browsing user. This second visit will ignore the robots.txt file. The second visit probably doesn't actually index (if that's your worry) but it does check to make sure you're not trying to fool the indexing bot (for SEO etc).
That being said your syntax is right... if that's all you're asking, then yes it'll work, just not as well as you might hope.
Absent the redirect, Googlebot would not see your site, except for the index.php.
With the redirect, it depends on how the bot handles redirects and how your htaccess does the redirect. If you return a 302, then Googlebot will see http://www.example.com/, check against robots.txt, and not see the main site. Even if you do an internal redirect and tell Googlebot that the responding page is http://www.example.com/, it will see the page but might not index it.
It's risky. To be sure that Google does index your homepage make this:
User-agent: *
Allow: /index.php
Disallow: /a
Disallow: /b
...
Disallow: /z
Disallow: /0
...
Disallow: /9
So your root "/" will not match disallow rules.
Also if you have AdSense don't forget to add
User-agent: Mediapartners-Google
Allow: /