How to recrawle nutch - nutch

I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching.
Now my problem is, when I try to re-crawl some site like trailer.apple.com or any other site, it is always crawl the last crawled urls. Even I have removed the last crawled urls from seeds.txt file and entered the new Urls. But Nutch is not crawling the new Urls.
Can anybody tell me, what actually I am doing wrong.
Also please suggest me any Nutch Plugin that can help for crawling the videos and movies sites.
Any help will really appreciable.

I have the same problem. Nutch re-crawl only the old urls, even they not exist in seed.txt.
First time when I start nutch I do the following:
Add domain "www.domain01.com" in /root/Desktop/apache-nutch 2.1/runtime/local/urls/seed.txt (without quotes)
In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
In /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
... and everything was fine.
Next I made the following changes:
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt and add two new domains: www.domain02.com and www.domain03.com
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Next I execute the following commands:
updatedb
bin/nutch inject urls
bin/nutch generate urls
bin/nutch updatedb
bin/nutch crawl urls -depth 3
And nutch still crawl the www.domain01.com
I don't know why ?
I use Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux is started on virtual machine on Windows 7 (x64).

This post is a bit outdated but still valid for the most parts: http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ perhaps the last crawled pages are the ones that change the most. Nutch uses an adaptative algorithm to schedule re-crawls, so when a page is very static it should not be re-crawled very often. You can override how often you want to recrawl using nutch-site.xml. Also, the seed.txt file is supposed to be a seed list, once that you inject the URLs Nutch does not use it anymore(unless you manually re-inject it again)
Another configuration that may help is your regex-urlfilter.txt, if you want to point to an specific place or exclude certain domains/pages etc.
Cheers.

u just add ur nutch-site.xml below property tag. it works for me ,,,,,,,check it..........
<property>
<name>file.crawl.parent</name>
<value>false</value>
</property
and u just change regex-urlfilter.txt
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
# accept anything else
+.
after remove that indexing dir manual or command also like..
rm -r $NUTCH_HOME/indexdir
after run ur crawl cammand...........

Related

Are example.com/ and example.com/index.html considered same in case of nutch 1.11?

I have upgraded my application to nutch 1.11 from nutch 1.3. Previously I used to get 2 urls example.com/ and example.com/index.html while crawling through nutch 1.3.
But after upgradation i have either of two. I want to confirm that Is upgraded nutch is smart enough to detect this ?
Nutch 1.11 will crawl and index both example.com and example.com/index.html, given that
both are included in seeds or reachable via links from one of the seeds
URL normalization or filter rules accept both and do not normalize one
they are no duplicates (identical content)
both of them are real pages and no redirects
Regarding 2: there is a rule in regex-normalize.xml which does the described normalization. By default it's not active (commented out):
<!-- changes default pages into standard for /index.html, etc. into /
<regex>
<pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&|#|$)</pattern>
<substitution>/$3</substitution>
</regex> -->
Regarding 3: deduplication has been significantly improved for Nutch 1.8 and is now no operation on the index but flags duplicates directly in CrawlDb. However, you should see in the logs that both URLs are fetched, duplication is done later based on the checksum of the fetched content.

Using Nutch to crawl a specified URL list

I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch automatically fetches not-on-list URLs. I do set the crawl parameters as -depth 1 -topN 1000000. But it does not work. Does anyone know how to do this?
Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)
<property>
<name>db.update.additions.allowed</name>
<value>false</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>
Delete the crawl and urls directory (if created before)
Create and Update the seed file ( where URLs are listed 1URL per row)
Restart the crawling process
Command
nutch crawl urllist -dir crawl -depth 3 -topN 1000000
urllist - Directory where seed file (url list) is present
crawl - Directory name
Even if the problem persists, try to delete your nutch folder and restart the whole process.

Nutch 1.2 - Why won't nutch crawl url with query strings?

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:
# skip urls with these characters
#-[]
#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.
Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?
See my previous question here Adding URL parameter to Nutch/Solr index and search results
The first 'Edit' should answer your question.
# skip URLs containing certain characters as probable queries, etc.
#-[?*!#=]
You have to comment it or modify it as :
# skip URLs containing certain characters as probable queries, etc.
-[*!#]
By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

how to make nutch crawler crawl

i have some doubt in nutch
while i used the wiki i am asked to edit the crawl-urlfilter.txt
+^http://([a-z0-9]*\.)*apache.org/
and i am asked to create an url folder and an list of url...
do i need to create all the links in crawl-urlfilter.txt and in the list of url ...
Yes and no.
crawl-urlfiler.txt act as a filter, so only urls on apache.org will ever be crawled in your example
The url folder gives the 'seed' urls where to let the crawler start.
So if you want the crawler to stay in a set of sites, you will want to make sure they have a positive match with the filter... otherwise it will crawl the entire web. This may mean you have to put the list of sites in the filter

Nutch issues with crwaling website where the url differes only in termes of parameters passes

I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other.
The urls on my webiste are basically of this format
http://mysite.com/index.php?main_page=index&params=12
http://mysite.com/index.php?main_page=index&category=tub&param=17
i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)
Is Nutch unable to crawl such webistes?
What Nutch settings should I do in order to crawl such websites?
I got the issue fixed.
It had everything to do with the url filter set as
skip URLs containing certain characters as probable queries, etc
-[?*!#=]
I commented this filter and Nutch crawle dall urls :)

Resources