I crawl sites in nutch 1.3. now I want to delete a url from crawldb, how can I do this? how I read from crawldb? I want to see urls that exist in crawldb.
To read from crawlDb you can use the CrawlDBReader class(org.apache.nutch.crawl package). To delete/remove a url from the crawlDb you can use try using the CrawlDBMerger class(org.apache.nutch.crawl package) with the "-filter" option. But I suggest writing a Mapreduce to delete urls according to your needs.
Related
I am crawling a website using Apache Nutch. While crawling, I want nutch to ignore multiple url patterns like http://www.youtube.com/..so on..., http://www.twitter.com/so on.., etc.
I know how to configure regex-urlfilter.txt file to crawl specific url.
But I dont know how to configure nutch to ignore certain url patterns?
I followed following url and found many useful examples
https://scm.thm.de/pharus/nutch-config/blobs/66fba7d3dc015974b5c194e7ba49da60fe3c3199/Nutch-Config/conf/regex-urlfilter.txt
I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url.
I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local where I run my crawl command. Please advise what might be the issue?
Your crawl database contains a list of URLs to crawl. Unless you delete the original crawl directory or create a new one as part of your new crawl, the original list of URLs will be used and extended with the new URL.
i crawl a site with nutch 1.4, i understand that nutch dosen't crawl all links in this site. i have no filter and no limit rule to crawling. for example nutch never crawl this link:
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
if i give this link to nutch to crawl, nutch never crawl this link. this site is farsi and not English.
how i can crawl this link?
Nutch runs URL normalization and other url processing stuff on each url before adding it the crawldb. Your url might have got filtered there itself. You can remove those plugins from the list of plugins used (plugin.includes property in conf/nutch-site.xml) and try again.
One reason it might fail to fetch the non-English URL is because different URL-encoding used by the web-server at www.irna.ir and the used nutch client.
Some site have url pattern as www.___.com/id=1 to www.___.com/id=1000. How can I crawl the site using nutch. Is there any wway to provide seed for fetching in range??
I think the easiest way would be to have a script to generate your initial list of urls.
no. you have inject them manually or using a script
Does Nutch crawl automatically when I add new pages to the website?
No, you have to recrawl or create the index from scratch.
Tt wont re-crawl automatically. You can do either of these:
make the parent page of the new url re-crawled so that the new url enters the crawldb and will be fetched in subsequent fetch round.
Add the new url directly to the crawldb via inject command.
You should do scheduled crawling to keep your data up-to-date.
Open source Java Job Schedulers