On which criteia nutch selects TopN docs while crawling? And how nutch creates segments.?
Here are the things that are taken into account:
The score of the url
how many urls belonging to the same host are allowed to be crawled.
Is the re-fetch time of the url reached ?
Related
I am crawling for example 1000 websites.when I readdb for some websites it is showing db_redirect_temp and db_redirect_moved if I set http.redirect.max=10 is this value for each website or it treat only 10 redirects for entire crawling websites.
http.redirect.max is defined as:
The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching.
The number applies to the redirects of a single web page. 10 is a really generous limit, 3 should be enough in most cases given that the redirect target will be tried in one of the later fetch cycles anyway. Note that the redirect source is always recorded in the CrawlDb as db_redir_perm or db_redir_temp.
I have upgraded my application to nutch 1.11 from nutch 1.3. Previously I used to get 2 urls example.com/ and example.com/index.html while crawling through nutch 1.3.
But after upgradation i have either of two. I want to confirm that Is upgraded nutch is smart enough to detect this ?
Nutch 1.11 will crawl and index both example.com and example.com/index.html, given that
both are included in seeds or reachable via links from one of the seeds
URL normalization or filter rules accept both and do not normalize one
they are no duplicates (identical content)
both of them are real pages and no redirects
Regarding 2: there is a rule in regex-normalize.xml which does the described normalization. By default it's not active (commented out):
<!-- changes default pages into standard for /index.html, etc. into /
<regex>
<pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&|#|$)</pattern>
<substitution>/$3</substitution>
</regex> -->
Regarding 3: deduplication has been significantly improved for Nutch 1.8 and is now no operation on the index but flags duplicates directly in CrawlDb. However, you should see in the logs that both URLs are fetched, duplication is done later based on the checksum of the fetched content.
I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:
# skip urls with these characters
#-[]
#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.
Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?
See my previous question here Adding URL parameter to Nutch/Solr index and search results
The first 'Edit' should answer your question.
# skip URLs containing certain characters as probable queries, etc.
#-[?*!#=]
You have to comment it or modify it as :
# skip URLs containing certain characters as probable queries, etc.
-[*!#]
By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.
i have some doubt in nutch
while i used the wiki i am asked to edit the crawl-urlfilter.txt
+^http://([a-z0-9]*\.)*apache.org/
and i am asked to create an url folder and an list of url...
do i need to create all the links in crawl-urlfilter.txt and in the list of url ...
Yes and no.
crawl-urlfiler.txt act as a filter, so only urls on apache.org will ever be crawled in your example
The url folder gives the 'seed' urls where to let the crawler start.
So if you want the crawler to stay in a set of sites, you will want to make sure they have a positive match with the filter... otherwise it will crawl the entire web. This may mean you have to put the list of sites in the filter
I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other.
The urls on my webiste are basically of this format
http://mysite.com/index.php?main_page=index¶ms=12
http://mysite.com/index.php?main_page=index&category=tub¶m=17
i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)
Is Nutch unable to crawl such webistes?
What Nutch settings should I do in order to crawl such websites?
I got the issue fixed.
It had everything to do with the url filter set as
skip URLs containing certain characters as probable queries, etc
-[?*!#=]
I commented this filter and Nutch crawle dall urls :)