Using Nutch to crawl a specified URL list - nutch

I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch automatically fetches not-on-list URLs. I do set the crawl parameters as -depth 1 -topN 1000000. But it does not work. Does anyone know how to do this?

Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)
<property>
<name>db.update.additions.allowed</name>
<value>false</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>

Delete the crawl and urls directory (if created before)
Create and Update the seed file ( where URLs are listed 1URL per row)
Restart the crawling process
Command
nutch crawl urllist -dir crawl -depth 3 -topN 1000000
urllist - Directory where seed file (url list) is present
crawl - Directory name
Even if the problem persists, try to delete your nutch folder and restart the whole process.

Related

nutch redirection handling issue

I am bit new to nutch . Thing is I am crawling a url which redirects to another url .Now when analysing my crawl results I get content of first url along with status code : temp redirected to (second url name) . Now my question is that why I am not getting content and details of that second url .Is that redirected url getting crawled or not? Please help.
Again, in the omnipotent nutch-default.xml, there is an attribute that controls the way how Nutch handles redirection.
<property>
<name>http.redirect.max</name>
<value>0</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
As the description has mentioned, fetcher won't immediately follow redirected URLs and record them for later fetching. I still have not figured out how to force the URLs in db_redir_temp to be fetched. However, if you change the configuration right at the beginning, I assume your probably might go away.
In Nutch2.3.1, I tried to set following property in my nutch-site.xml and it helped me to fetch redirected URL in the next attempt. It may be helpful to someone trying on Nutch 2.3.1.
<property>
<name>db.fetch.interval.default</name>
<value>0</value>
<description>The default number of seconds between re-fetches of a page (30 days).
</description>
</property>
In Nutch 2.3.1 there's a method called getProtocolOutput in the class
org.apache.nutch.protocol.http.api.HttpBase
in this method there's a calling to another method
Response response = getResponse(u, page, false); (Line 250)
Change the false To true in the previous code
As this flag refer to followRedirects
Then recompile the nutch classes , and follow redirect will work correctly :)

How to recrawle nutch

I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching.
Now my problem is, when I try to re-crawl some site like trailer.apple.com or any other site, it is always crawl the last crawled urls. Even I have removed the last crawled urls from seeds.txt file and entered the new Urls. But Nutch is not crawling the new Urls.
Can anybody tell me, what actually I am doing wrong.
Also please suggest me any Nutch Plugin that can help for crawling the videos and movies sites.
Any help will really appreciable.
I have the same problem. Nutch re-crawl only the old urls, even they not exist in seed.txt.
First time when I start nutch I do the following:
Add domain "www.domain01.com" in /root/Desktop/apache-nutch 2.1/runtime/local/urls/seed.txt (without quotes)
In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
In /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
... and everything was fine.
Next I made the following changes:
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt and add two new domains: www.domain02.com and www.domain03.com
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Next I execute the following commands:
updatedb
bin/nutch inject urls
bin/nutch generate urls
bin/nutch updatedb
bin/nutch crawl urls -depth 3
And nutch still crawl the www.domain01.com
I don't know why ?
I use Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux is started on virtual machine on Windows 7 (x64).
This post is a bit outdated but still valid for the most parts: http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ perhaps the last crawled pages are the ones that change the most. Nutch uses an adaptative algorithm to schedule re-crawls, so when a page is very static it should not be re-crawled very often. You can override how often you want to recrawl using nutch-site.xml. Also, the seed.txt file is supposed to be a seed list, once that you inject the URLs Nutch does not use it anymore(unless you manually re-inject it again)
Another configuration that may help is your regex-urlfilter.txt, if you want to point to an specific place or exclude certain domains/pages etc.
Cheers.
u just add ur nutch-site.xml below property tag. it works for me ,,,,,,,check it..........
<property>
<name>file.crawl.parent</name>
<value>false</value>
</property
and u just change regex-urlfilter.txt
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
# accept anything else
+.
after remove that indexing dir manual or command also like..
rm -r $NUTCH_HOME/indexdir
after run ur crawl cammand...........

On which criteia nutch selects TopN docs while crawling?

On which criteia nutch selects TopN docs while crawling? And how nutch creates segments.?
Here are the things that are taken into account:
The score of the url
how many urls belonging to the same host are allowed to be crawled.
Is the re-fetch time of the url reached ?

how to make nutch crawler crawl

i have some doubt in nutch
while i used the wiki i am asked to edit the crawl-urlfilter.txt
+^http://([a-z0-9]*\.)*apache.org/
and i am asked to create an url folder and an list of url...
do i need to create all the links in crawl-urlfilter.txt and in the list of url ...
Yes and no.
crawl-urlfiler.txt act as a filter, so only urls on apache.org will ever be crawled in your example
The url folder gives the 'seed' urls where to let the crawler start.
So if you want the crawler to stay in a set of sites, you will want to make sure they have a positive match with the filter... otherwise it will crawl the entire web. This may mean you have to put the list of sites in the filter

Nutch issues with crwaling website where the url differes only in termes of parameters passes

I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other.
The urls on my webiste are basically of this format
http://mysite.com/index.php?main_page=index&params=12
http://mysite.com/index.php?main_page=index&category=tub&param=17
i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)
Is Nutch unable to crawl such webistes?
What Nutch settings should I do in order to crawl such websites?
I got the issue fixed.
It had everything to do with the url filter set as
skip URLs containing certain characters as probable queries, etc
-[?*!#=]
I commented this filter and Nutch crawle dall urls :)

Resources