Strange robots.txt problem in Apache nutch 1.17 - nutch

I am having an strange problem with robots.txt when using nutch 1.17. I am using selenium protocol. Tried both firefox and chrome. The logs shows that robots.txt file cannot be parsed.
2020-09-14 08:15:45,751 WARN robots.SimpleRobotRulesParser - Problem processing robots.txt for https://website.com/some.html
2020-09-14 08:15:45,751 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 156): ^#^#^#^#^#^#^C^K-N-�MLO�+�R�MM�L,H,*�K-*�u��O�I�r�,N���/��r.J,��MI�I��R02��
2020-09-14 08:15:45,752 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 156): E�^KQ��_B�������
2020-09-14 08:15:45,753 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 156): �j�Ss�J�3��MU���F7�^Q����<��T���^�FI��I/J,H-������5^B�[�p^E��^A^W^#^Z�`X�
I checked into robots.txt everythings fine.
User-agent: *
Disallow: /index.php/
Disallow: /*?
Disallow: /report/
Disallow: /var/
Disallow: /path/
I don't know what's going on underneath nutch. But seems like nutch is trying to parse html page, instead of going via robots.txt of that particular domain. Does anyone knows about this issue?

Related

How to block all bots except google and Bing?

How to block all bots except google and Bing.
I am using Cloudflare but I am confused, how to do.
I want all bots except these face Cloudflare JS Challenge
Simple: create file in root directory robots.txt
and add which bot you want to allow or disallow
# robots.txt
User-agent: *
Disallow: /
User-agent: bingbot
allow: /
Disallow: /some-page-for-bingbot/
User-agent: googlebot
allow: /
Disallow: /some-page-for-googlebot/

Exclude one of subdomains from being crawled using Robots.txt

We have an Umbraco website which has several sub-domains and we want to exclude one of them from being crawled in search engines for now.
I tried to change my Robots.txt file but seems I am not doing it right.
Url: http://mywebsite.co.dl/
subdomain: http://sub1.mywebsite.co.dl/
My Robots.txt content is as follow:
User-agent: *
Disallow: sub1.*
What I have missed?
The following code will block http://sub1.mywebsite.co.dl. from being indexed:
User-agent: *
Disallow: /sub1/
You can also add another robots.txt file in the sub1 folder with the following code:
User-agent: *
Disallow: /
and that should help as well.
If you want to block anything on http://sub1.mywebsite.co.dl/, your robots.txt MUST be accessible on http://sub1.mywebsite.co.dl/robots.txt.
This robots.txt will block all URLs for all supporting bots:
User-agent: *
Disallow: /

Description of google search result not available due to robots.text

I tried to search for my site in google "DailyMuses" and got the following result
It says that "A description for this result is not available because of this site's robots.txt"
I went to check the contents of robot.text in my web app and the contents is as follows:
# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-Agent: *
Disallow: /
Can anyone advise me on how I can get around this and allow a description to be shown in the google search?
Remove the robots.txt . Done .
UPDATE : To allow search engine to crawl only the index page, use this:
User-agent: *
Allow: /index.php
Allow: /$
Disallow: /
or replace index.php with your index file name, such as index.html or index.jsp

Blocking Links From Being Indexed in Opencart

I seem to be receiving a lot of 404 errors from Google Webmaster tools of late. Is there a way to prohibit these sort of links from being indexed? and where do they come from?
Appreciate if you could shed some light on this matter. Thanks in advance.
I am on opencart 1.5.4.1.
These are the 404 errors
http://mydomain.com/product/search&filter_tag=Purse&sort=rating&order=ASC&limit=25
http://mydomain.com/product/search&sort=p.price&order=DESC&filter_tag=&limit=25
http://mydomain.com/cart?qty=1&id_product=323&token=bfcf19e76ed4a88abd8508e8dc097a70&add
I've also included the following entries in my robots.txt to no avail
User-agent: Googlebot
User-agent: *
Disallow: /cgi-bin
Disallow: /*?limit
Disallow: /*&limit
Disallow: /*?sort
Disallow: /*&sort
Disallow: /*?route=checkout
Disallow: /*?route=account
Disallow: /*?route=product/search
Disallow: /*?route=affiliate
Disallow: /*&keyword
Disallow: /*?page=1
That looks like you are using an extension that is trying to rewrite ugly url's to better/more friendly ones - or were at one point. You should use .htaccess to rewrite the bad URL's to the correct ones rather than just blocking pages if you can. This will help your site

What is called first - robots.txt or mod_rewrite in htaccess

I need some help. I'm not sure about the order on request for mod_rewrite and robots.txt.
Some urls belong to a rewrite rule:
/index.php?id=123 to /home
Other urls don't have a rewrite:
/index.php?id=444
I made this entry to my robots.txt:
User-agent: *
Disallow: /index.php?id
Will the site with /home be indexed by search engines?
The robots.txt file is interpreted by the client (spider), and they don't know what rewrites you have in your system. Thus, spiders would not fetch URLs from your site if they look like the pattern in robots.txt but would if they found the same content through /home.

Resources