I am using apache-nutch-1.6 and I can successfuly crawl web sites.
My Problem is that not all entries in the seed.txt file are used. It depends on which sites are inside. So is there anywhere a Limit how much is crawled? No error message. Just if I delete one site an other sites is deeply crawled, whereever if the other one is there this one is crawled and from the other sites only the top sites I beleave....
Configure this correctly:
bin/nutch crawl $URLS -dir $CRAWL_LOC -depth 10 -topN 1000
Depth: nutch will crawl upto this level in depth
topN: in each level, nutch will crawl this number of url's
Related
I am using Nutch to collect all the data from a single domain. How can I ensure that Nutch has crawled every page under a given domain?
This is not technically possible. Since there is no limit on the number of different pages that you can have under the same domain. This is especially true for dynamic generated websites. What you could do is look for a sitemap.xml and ensure that all of those URLs are crawled/indexed by Nutch. Since the sitemap is the one indicating that those are the URLs you could use them as guide for what needs to be crawled.
Nutch has a sitemap processor that will inject all the URLs from the sitemap to the current crawldb (i.e it will "schedule" those URLs to be crawled).
As a hint, even Google enforces a maximum number of URLs to be indexed from the same domain when doing a deep crawl. This is usually referred to as a crawl budget.
For example if my website contains 10 urls in total, in my first crawl I crawl all the urls and when I crawl second time, it should crawl only the urls/pages which has changes and do not crawl other pages.Can nutch use sitemaps to determine the pages which have changed and crawl them.
I have an issue where I try to issue a new crawl on something ive already crawled, but with some new URLS.
so first i have
urls/urls.txt -> www.somewebsite.com
i then issue the command
bin/nutch crawl urls -dir crawl -depth 60 -threads 50
i then update urls/urls.txt -> remove www.somewebsite.com -> add www.anotherwebsite.com
i issue the command
bin/nutch inject crawl urls
bin/nutch crawl urls -dir crawl -depth 60 -threads 50
What i would expect here, is that www.anotherwebsite.com is injected into the existing 'crawl' db, and when crawl is issued again it should only crawl the new website ive added www.anotherwebsite.com (as the refetch for the original is set to 30 days)
What I have experienced is that either
1.) no website is crawled
2.) only the original website is crawled
'sometimes' if i leave it for a few hours it starts working and picks up the new website and crawls both the old website and new one (even though the refetch time is set to 30 days)
its very weird and unpredictable behaviour.
Im pretty sure my regex-urlfilter file is set correctly, and my nutch-site / nutch-default is all setup with defaults (near enough).
Questions:
can anyone explain simply (with commands) what is happening during each crawl, and how to update an existing crawl db with some new urls?
can anyone explain (with commands) how i force a recrawl of 'all' urls in the crawl db? - i have issued a readdb and checked the refetch times, and most are set to a month, but what if i want to refetch again sooner?
Article Here explains the crawl process in sufficient depth
i crawl a site with nutch 1.4, i understand that nutch dosen't crawl all links in this site. i have no filter and no limit rule to crawling. for example nutch never crawl this link:
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
if i give this link to nutch to crawl, nutch never crawl this link. this site is farsi and not English.
how i can crawl this link?
Nutch runs URL normalization and other url processing stuff on each url before adding it the crawldb. Your url might have got filtered there itself. You can remove those plugins from the list of plugins used (plugin.includes property in conf/nutch-site.xml) and try again.
One reason it might fail to fetch the non-English URL is because different URL-encoding used by the web-server at www.irna.ir and the used nutch client.
I use Nutch and i would like index an intranet, but how to make sure everything on the intranet will be indexed ?
Thanks.
If you know all the urls of the intranet, then write a robots.txt (or an equivalent page with all the urls and point the crawler to it).
If you don't then you cannot be never secure that you'll have crawled all the urls, because you cannot verify it after the crawling.
In the last case the best chance is to do the crawl at the maximum depth.
Regards