i crawl a site with nutch 1.4, i understand that nutch dosen't crawl all links in this site. i have no filter and no limit rule to crawling. for example nutch never crawl this link:
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
if i give this link to nutch to crawl, nutch never crawl this link. this site is farsi and not English.
how i can crawl this link?
Nutch runs URL normalization and other url processing stuff on each url before adding it the crawldb. Your url might have got filtered there itself. You can remove those plugins from the list of plugins used (plugin.includes property in conf/nutch-site.xml) and try again.
One reason it might fail to fetch the non-English URL is because different URL-encoding used by the web-server at www.irna.ir and the used nutch client.
Related
I am using Nutch to collect all the data from a single domain. How can I ensure that Nutch has crawled every page under a given domain?
This is not technically possible. Since there is no limit on the number of different pages that you can have under the same domain. This is especially true for dynamic generated websites. What you could do is look for a sitemap.xml and ensure that all of those URLs are crawled/indexed by Nutch. Since the sitemap is the one indicating that those are the URLs you could use them as guide for what needs to be crawled.
Nutch has a sitemap processor that will inject all the URLs from the sitemap to the current crawldb (i.e it will "schedule" those URLs to be crawled).
As a hint, even Google enforces a maximum number of URLs to be indexed from the same domain when doing a deep crawl. This is usually referred to as a crawl budget.
I am crawling a website using Apache Nutch. While crawling, I want nutch to ignore multiple url patterns like http://www.youtube.com/..so on..., http://www.twitter.com/so on.., etc.
I know how to configure regex-urlfilter.txt file to crawl specific url.
But I dont know how to configure nutch to ignore certain url patterns?
I followed following url and found many useful examples
https://scm.thm.de/pharus/nutch-config/blobs/66fba7d3dc015974b5c194e7ba49da60fe3c3199/Nutch-Config/conf/regex-urlfilter.txt
I use Nutch and i would like index an intranet, but how to make sure everything on the intranet will be indexed ?
Thanks.
If you know all the urls of the intranet, then write a robots.txt (or an equivalent page with all the urls and point the crawler to it).
If you don't then you cannot be never secure that you'll have crawled all the urls, because you cannot verify it after the crawling.
In the last case the best chance is to do the crawl at the maximum depth.
Regards
I crawl sites in nutch 1.3. now I want to delete a url from crawldb, how can I do this? how I read from crawldb? I want to see urls that exist in crawldb.
To read from crawlDb you can use the CrawlDBReader class(org.apache.nutch.crawl package). To delete/remove a url from the crawlDb you can use try using the CrawlDBMerger class(org.apache.nutch.crawl package) with the "-filter" option. But I suggest writing a Mapreduce to delete urls according to your needs.
Some site have url pattern as www.___.com/id=1 to www.___.com/id=1000. How can I crawl the site using nutch. Is there any wway to provide seed for fetching in range??
I think the easiest way would be to have a script to generate your initial list of urls.
no. you have inject them manually or using a script