How to limit nutch 2.x to fetch only N (5-10) pages from each host or domain? I fail to figure it out from config files.
I've found: option called generate.max.count.
Related
I am crawling a website using Apache Nutch. While crawling, I want nutch to ignore multiple url patterns like http://www.youtube.com/..so on..., http://www.twitter.com/so on.., etc.
I know how to configure regex-urlfilter.txt file to crawl specific url.
But I dont know how to configure nutch to ignore certain url patterns?
I followed following url and found many useful examples
https://scm.thm.de/pharus/nutch-config/blobs/66fba7d3dc015974b5c194e7ba49da60fe3c3199/Nutch-Config/conf/regex-urlfilter.txt
i crawl a site with nutch 1.4, i understand that nutch dosen't crawl all links in this site. i have no filter and no limit rule to crawling. for example nutch never crawl this link:
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
if i give this link to nutch to crawl, nutch never crawl this link. this site is farsi and not English.
how i can crawl this link?
Nutch runs URL normalization and other url processing stuff on each url before adding it the crawldb. Your url might have got filtered there itself. You can remove those plugins from the list of plugins used (plugin.includes property in conf/nutch-site.xml) and try again.
One reason it might fail to fetch the non-English URL is because different URL-encoding used by the web-server at www.irna.ir and the used nutch client.
I crawl sites in nutch 1.3. now I want to delete a url from crawldb, how can I do this? how I read from crawldb? I want to see urls that exist in crawldb.
To read from crawlDb you can use the CrawlDBReader class(org.apache.nutch.crawl package). To delete/remove a url from the crawlDb you can use try using the CrawlDBMerger class(org.apache.nutch.crawl package) with the "-filter" option. But I suggest writing a Mapreduce to delete urls according to your needs.
I have installed Apache Solr and run 2 times manually cron but I have a problem that 0% was sent to server:
The search index is generated by running cron. 0% of the site content has been sent to the server. There are 2884 items left to send.
Using schema.xml version: drupal-1.1
The server has a 2 min. delay before updates are processed.
Number of documents in index: 220
Number of pending deletions: 0
All messages seems to be ok:
* Apache Solr: Your site has contacted the Apache Solr server.
* Apache Solr PHP Client Library: Correct version "Revision: 22".
I replaced solrconfig.xml and schema.xml in /solr/example/solr/conf with those from Apache Solr Drupal module.
Could somebody give me advice, what should I check?
Regards
I've just fixed the same problem. The problem was with sites/default/files folder and folder permissions.
Try to set 775 for files/ folder and etc, I guess on /admin/reports/status you'll see lots of warnings for misconfigured perms.
After resolving this stuff, solr indexed content properly.
Some site have url pattern as www.___.com/id=1 to www.___.com/id=1000. How can I crawl the site using nutch. Is there any wway to provide seed for fetching in range??
I think the easiest way would be to have a script to generate your initial list of urls.
no. you have inject them manually or using a script