Description of google search result not available due to robots.text

Description of google search result not available due to robots.text - search

I tried to search for my site in google "DailyMuses" and got the following result
It says that "A description for this result is not available because of this site's robots.txt"
I went to check the contents of robot.text in my web app and the contents is as follows:
# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-Agent: *
Disallow: /
Can anyone advise me on how I can get around this and allow a description to be shown in the google search?

Remove the robots.txt . Done .
UPDATE : To allow search engine to crawl only the index page, use this:
User-agent: *
Allow: /index.php
Allow: /$
Disallow: /
or replace index.php with your index file name, such as index.html or index.jsp

Related

How to block all bots except google and Bing?

How to block all bots except google and Bing.
I am using Cloudflare but I am confused, how to do.
I want all bots except these face Cloudflare JS Challenge

Simple: create file in root directory robots.txt
and add which bot you want to allow or disallow
# robots.txt
User-agent: *
Disallow: /
User-agent: bingbot
allow: /
Disallow: /some-page-for-bingbot/
User-agent: googlebot
allow: /
Disallow: /some-page-for-googlebot/

Exclude one of subdomains from being crawled using Robots.txt

We have an Umbraco website which has several sub-domains and we want to exclude one of them from being crawled in search engines for now.
I tried to change my Robots.txt file but seems I am not doing it right.
Url: http://mywebsite.co.dl/
subdomain: http://sub1.mywebsite.co.dl/
My Robots.txt content is as follow:
User-agent: *
Disallow: sub1.*
What I have missed?

The following code will block http://sub1.mywebsite.co.dl. from being indexed:
User-agent: *
Disallow: /sub1/
You can also add another robots.txt file in the sub1 folder with the following code:
User-agent: *
Disallow: /
and that should help as well.

If you want to block anything on http://sub1.mywebsite.co.dl/, your robots.txt MUST be accessible on http://sub1.mywebsite.co.dl/robots.txt.
This robots.txt will block all URLs for all supporting bots:
User-agent: *
Disallow: /

Htaccess to use the hosting for live testing

I would use the hosting for live testing, but I want to protect access and prevent search engine indexing.
For example (server directory structure) within public_html:
_private
_bin
_cnf
_log
_ ... (more default directories hosting)
testpublic
css
images
index.html
I want index.html is visibile to everyone and all other directories (except "testpublic") are hidden, protected access and search engines not to index.
The directory "testpublic" I wish it was public but may not be indexed in search engines, not sure if this is possible.
To do understand that I need 2 files .htaccess.
One general in "public_html" and other specific for "testpublic".
The .htaccess general (public_html) I think it should be something like:
AuthUserFile /home/folder../.htpasswd
AuthName "test!"
AuthType Basic
require user admin123
< FilesMatch "index.html">
Satisfy Any
< / FilesMatch>
Can anyone help me create the files with the appropriate properties? Thank you!

You can use a robots.txt file in your root folder. All standards-abiding robots will obey this file and not index your files and folders.
Example Robots.txt that tells all (*) crawlers to move on and index nothing.
User-agent: *
Disallow: /
You could use .htaccess files to fine tune what your server (assuming Apache) serves out and what directory indexes are visible. In which case you would add
IndexIgnore *
To your .htaccess file to disallow indexes.
Updated (Credit to https://stackoverflow.com/users/1714715/samuel-cook):
If you want to specifically stop a bot/crawler and know its USER AGENT string you can do so in your .htaccess
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteRule ^.* - [F,L]
</IfModule>
Hope this helps.

Remove page from Google Listings using robots.txt file

Is this the correct way to do this - below is my txt file, would this prevent Google from indexing my admin directory as well as oldpage.php?
User-agent: *
Allow: /
Disallow: /admin/
Disallow: http://www.mysite.com/oldpage.php

Yes you are absolutely correct except single file restriction.
User-agent: * : means for all crawler
Allow: / : allow access of full site
Disallow: /admin/ : restrict to admin directory
Disallow: /oldpage.php : restrict to oldpage.php

What is called first - robots.txt or mod_rewrite in htaccess

I need some help. I'm not sure about the order on request for mod_rewrite and robots.txt.
Some urls belong to a rewrite rule:
/index.php?id=123 to /home
Other urls don't have a rewrite:
/index.php?id=444
I made this entry to my robots.txt:
User-agent: *
Disallow: /index.php?id
Will the site with /home be indexed by search engines?

The robots.txt file is interpreted by the client (spider), and they don't know what rewrites you have in your system. Thus, spiders would not fetch URLs from your site if they look like the pattern in robots.txt but would if they found the same content through /home.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Description of google search result not available due to robots.text - search

Remove the robots.txt . Done . UPDATE : To allow search engine to crawl only the index page, use this: User-agent: * Allow: /index.php Allow: /$ Disallow: / or replace index.php with your index file name, such as index.html or index.jsp

Related

How to block all bots except google and Bing?

Exclude one of subdomains from being crawled using Robots.txt

Htaccess to use the hosting for live testing

Remove page from Google Listings using robots.txt file

What is called first - robots.txt or mod_rewrite in htaccess

Categories

Resources