I am bit new to nutch . Thing is I am crawling a url which redirects to another url .Now when analysing my crawl results I get content of first url along with status code : temp redirected to (second url name) . Now my question is that why I am not getting content and details of that second url .Is that redirected url getting crawled or not? Please help.
Again, in the omnipotent nutch-default.xml, there is an attribute that controls the way how Nutch handles redirection.
<property>
<name>http.redirect.max</name>
<value>0</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
As the description has mentioned, fetcher won't immediately follow redirected URLs and record them for later fetching. I still have not figured out how to force the URLs in db_redir_temp to be fetched. However, if you change the configuration right at the beginning, I assume your probably might go away.
In Nutch2.3.1, I tried to set following property in my nutch-site.xml and it helped me to fetch redirected URL in the next attempt. It may be helpful to someone trying on Nutch 2.3.1.
<property>
<name>db.fetch.interval.default</name>
<value>0</value>
<description>The default number of seconds between re-fetches of a page (30 days).
</description>
</property>
In Nutch 2.3.1 there's a method called getProtocolOutput in the class
org.apache.nutch.protocol.http.api.HttpBase
in this method there's a calling to another method
Response response = getResponse(u, page, false); (Line 250)
Change the false To true in the previous code
As this flag refer to followRedirects
Then recompile the nutch classes , and follow redirect will work correctly :)
Related
I am crawling for example 1000 websites.when I readdb for some websites it is showing db_redirect_temp and db_redirect_moved if I set http.redirect.max=10 is this value for each website or it treat only 10 redirects for entire crawling websites.
http.redirect.max is defined as:
The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching.
The number applies to the redirects of a single web page. 10 is a really generous limit, 3 should be enough in most cases given that the redirect target will be tried in one of the later fetch cycles anyway. Note that the redirect source is always recorded in the CrawlDb as db_redir_perm or db_redir_temp.
At present my url structure of page is like this
Mysite.com/page-2
mysite.com/categoryname/page-2
I want to change it to
mysite.com/page/2/
mysite.com/category/categoryname/page/2/
Note that, I have multiple page, i.e. let's say a category have 100 page so I need to redirect all the earlier page number to the new url structure. Writing code of every url will be too tedious. I am not sure but using $ in the code can help.
Please let me know how can it be done via .htacess.
Thanks in advance.
Your URL structure is already with the desired one you have as I can see.
/category/telecom-news/page/2
I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching.
Now my problem is, when I try to re-crawl some site like trailer.apple.com or any other site, it is always crawl the last crawled urls. Even I have removed the last crawled urls from seeds.txt file and entered the new Urls. But Nutch is not crawling the new Urls.
Can anybody tell me, what actually I am doing wrong.
Also please suggest me any Nutch Plugin that can help for crawling the videos and movies sites.
Any help will really appreciable.
I have the same problem. Nutch re-crawl only the old urls, even they not exist in seed.txt.
First time when I start nutch I do the following:
Add domain "www.domain01.com" in /root/Desktop/apache-nutch 2.1/runtime/local/urls/seed.txt (without quotes)
In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
In /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
... and everything was fine.
Next I made the following changes:
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt and add two new domains: www.domain02.com and www.domain03.com
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Next I execute the following commands:
updatedb
bin/nutch inject urls
bin/nutch generate urls
bin/nutch updatedb
bin/nutch crawl urls -depth 3
And nutch still crawl the www.domain01.com
I don't know why ?
I use Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux is started on virtual machine on Windows 7 (x64).
This post is a bit outdated but still valid for the most parts: http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ perhaps the last crawled pages are the ones that change the most. Nutch uses an adaptative algorithm to schedule re-crawls, so when a page is very static it should not be re-crawled very often. You can override how often you want to recrawl using nutch-site.xml. Also, the seed.txt file is supposed to be a seed list, once that you inject the URLs Nutch does not use it anymore(unless you manually re-inject it again)
Another configuration that may help is your regex-urlfilter.txt, if you want to point to an specific place or exclude certain domains/pages etc.
Cheers.
u just add ur nutch-site.xml below property tag. it works for me ,,,,,,,check it..........
<property>
<name>file.crawl.parent</name>
<value>false</value>
</property
and u just change regex-urlfilter.txt
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
# accept anything else
+.
after remove that indexing dir manual or command also like..
rm -r $NUTCH_HOME/indexdir
after run ur crawl cammand...........
I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch automatically fetches not-on-list URLs. I do set the crawl parameters as -depth 1 -topN 1000000. But it does not work. Does anyone know how to do this?
Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)
<property>
<name>db.update.additions.allowed</name>
<value>false</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>
Delete the crawl and urls directory (if created before)
Create and Update the seed file ( where URLs are listed 1URL per row)
Restart the crawling process
Command
nutch crawl urllist -dir crawl -depth 3 -topN 1000000
urllist - Directory where seed file (url list) is present
crawl - Directory name
Even if the problem persists, try to delete your nutch folder and restart the whole process.
On which criteia nutch selects TopN docs while crawling? And how nutch creates segments.?
Here are the things that are taken into account:
The score of the url
how many urls belonging to the same host are allowed to be crawled.
Is the re-fetch time of the url reached ?