I have an issue where I try to issue a new crawl on something ive already crawled, but with some new URLS.
so first i have
urls/urls.txt -> www.somewebsite.com
i then issue the command
bin/nutch crawl urls -dir crawl -depth 60 -threads 50
i then update urls/urls.txt -> remove www.somewebsite.com -> add www.anotherwebsite.com
i issue the command
bin/nutch inject crawl urls
bin/nutch crawl urls -dir crawl -depth 60 -threads 50
What i would expect here, is that www.anotherwebsite.com is injected into the existing 'crawl' db, and when crawl is issued again it should only crawl the new website ive added www.anotherwebsite.com (as the refetch for the original is set to 30 days)
What I have experienced is that either
1.) no website is crawled
2.) only the original website is crawled
'sometimes' if i leave it for a few hours it starts working and picks up the new website and crawls both the old website and new one (even though the refetch time is set to 30 days)
its very weird and unpredictable behaviour.
Im pretty sure my regex-urlfilter file is set correctly, and my nutch-site / nutch-default is all setup with defaults (near enough).
Questions:
can anyone explain simply (with commands) what is happening during each crawl, and how to update an existing crawl db with some new urls?
can anyone explain (with commands) how i force a recrawl of 'all' urls in the crawl db? - i have issued a readdb and checked the refetch times, and most are set to a month, but what if i want to refetch again sooner?
Article Here explains the crawl process in sufficient depth
Related
May I know how to work in nutch if I am getting new urls daily to crawl new urls and how to store in crawldb.I am new to nutch please tell me the approach.
New URLs can be added to Nutch's CrawlDb at any time using the inject command. The newly added URLs are then fetched and processed in the next generate-fetch-update cycle.
For example if my website contains 10 urls in total, in my first crawl I crawl all the urls and when I crawl second time, it should crawl only the urls/pages which has changes and do not crawl other pages.Can nutch use sitemaps to determine the pages which have changed and crawl them.
I am using apache-nutch-1.6 and I can successfuly crawl web sites.
My Problem is that not all entries in the seed.txt file are used. It depends on which sites are inside. So is there anywhere a Limit how much is crawled? No error message. Just if I delete one site an other sites is deeply crawled, whereever if the other one is there this one is crawled and from the other sites only the top sites I beleave....
Configure this correctly:
bin/nutch crawl $URLS -dir $CRAWL_LOC -depth 10 -topN 1000
Depth: nutch will crawl upto this level in depth
topN: in each level, nutch will crawl this number of url's
I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url.
I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local where I run my crawl command. Please advise what might be the issue?
Your crawl database contains a list of URLs to crawl. Unless you delete the original crawl directory or create a new one as part of your new crawl, the original list of URLs will be used and extended with the new URL.
Does Nutch crawl automatically when I add new pages to the website?
No, you have to recrawl or create the index from scratch.
Tt wont re-crawl automatically. You can do either of these:
make the parent page of the new url re-crawled so that the new url enters the crawldb and will be fetched in subsequent fetch round.
Add the new url directly to the crawldb via inject command.
You should do scheduled crawling to keep your data up-to-date.
Open source Java Job Schedulers