How to configure Apache Nutch to ignore certain url patterns - nutch

I am crawling a website using Apache Nutch. While crawling, I want nutch to ignore multiple url patterns like http://www.youtube.com/..so on..., http://www.twitter.com/so on.., etc.
I know how to configure regex-urlfilter.txt file to crawl specific url.
But I dont know how to configure nutch to ignore certain url patterns?

I followed following url and found many useful examples
https://scm.thm.de/pharus/nutch-config/blobs/66fba7d3dc015974b5c194e7ba49da60fe3c3199/Nutch-Config/conf/regex-urlfilter.txt

Related

Can Search Engine read robots.txt if it's read access is restricted?

I have added robots.txt file and added some lines to restrict some folders.Also i added restriction from all to access that robots.txt file using .htaccess file.Can Search engines read content of that file?
This file should be freely readable. Search engine are like visitors on your website. If a visitor can't see this file, then the search engine will not be able to see it either.
There's absolutely no reason to try to hide this file.
Web crawlers need to be able to HTTP GET your robots.txt, or they will be unable to parse the file and respect your configuration.
The answer is no! But the simplest and safest too, is still to try:
https://support.google.com/webmasters/answer/6062598?hl=en
The robots.txt Tester tool shows you whether your robots.txt file
blocks Google web crawlers from specific URLs on your site. For
example, you can use this tool to test whether the Googlebot-Image
crawler can crawl the URL of an image you wish to block from Google
Image Search.

why nutch dosen't crawl all links in no English language sites?

i crawl a site with nutch 1.4, i understand that nutch dosen't crawl all links in this site. i have no filter and no limit rule to crawling. for example nutch never crawl this link:
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
if i give this link to nutch to crawl, nutch never crawl this link. this site is farsi and not English.
how i can crawl this link?
Nutch runs URL normalization and other url processing stuff on each url before adding it the crawldb. Your url might have got filtered there itself. You can remove those plugins from the list of plugins used (plugin.includes property in conf/nutch-site.xml) and try again.
One reason it might fail to fetch the non-English URL is because different URL-encoding used by the web-server at www.irna.ir and the used nutch client.

delete url from crawldb in nutch 1.3?

I crawl sites in nutch 1.3. now I want to delete a url from crawldb, how can I do this? how I read from crawldb? I want to see urls that exist in crawldb.
To read from crawlDb you can use the CrawlDBReader class(org.apache.nutch.crawl package). To delete/remove a url from the crawlDb you can use try using the CrawlDBMerger class(org.apache.nutch.crawl package) with the "-filter" option. But I suggest writing a Mapreduce to delete urls according to your needs.

nutch crawl path

I would like to know how to make nutch crawl not only the domain that I specified, but also the dir path within the domain that I specified. I know that you can configure this information on regex-urlfilter.txt
This should crawl only the domain/path you want :
+.*www\.domain\.com/yourpath/.*
#skip everything else
-.*

Nutch crawling with seeds urls are in range

Some site have url pattern as www.___.com/id=1 to www.___.com/id=1000. How can I crawl the site using nutch. Is there any wway to provide seed for fetching in range??
I think the easiest way would be to have a script to generate your initial list of urls.
no. you have inject them manually or using a script

Resources