crawl 1000 url per recrawl in nutch - nutch

hello
i wrote an crawl script to crawl the url and i need to fetch 1000 urls per crawl session if i use this
bin/nutch fetch $s1 -threads 100 -topN 1000
it crawls more than 1000 url i have no idea y it happens can any one tell me how can i crawl exactly 1000 urls per crawl session in nutch1.2

From the top of my head you should use
bin/nutch generate ... -topN 1000
Fetch only uses the result of generate.

Related

Nutch http.redirect.max may I know what does it Mean

I am crawling for example 1000 websites.when I readdb for some websites it is showing db_redirect_temp and db_redirect_moved if I set http.redirect.max=10 is this value for each website or it treat only 10 redirects for entire crawling websites.
http.redirect.max is defined as:
The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching.
The number applies to the redirects of a single web page. 10 is a really generous limit, 3 should be enough in most cases given that the redirect target will be tried in one of the later fetch cycles anyway. Note that the redirect source is always recorded in the CrawlDb as db_redir_perm or db_redir_temp.

Nutch crawling the Parent urls even specified the url filters

I have an issue with my crawl process. In the url-regexfilter.txt I specified the below filter
^+(http|https)://www.abc.com/subdomain
I want to block the parent URL i just want to crawl only the subsubdomains under subdomain. Help me out how to block the parent URL.
Try this
+^(http|https)://www.abc.com/subdomain
-^(http|https)://www.abc.com/
-^.
you can test if rejected or not with this
bin/nutch org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex
add your url, if - it's rejected or if + it's ok

Using Nutch to crawl a specified URL list

I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch automatically fetches not-on-list URLs. I do set the crawl parameters as -depth 1 -topN 1000000. But it does not work. Does anyone know how to do this?
Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)
<property>
<name>db.update.additions.allowed</name>
<value>false</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>
Delete the crawl and urls directory (if created before)
Create and Update the seed file ( where URLs are listed 1URL per row)
Restart the crawling process
Command
nutch crawl urllist -dir crawl -depth 3 -topN 1000000
urllist - Directory where seed file (url list) is present
crawl - Directory name
Even if the problem persists, try to delete your nutch folder and restart the whole process.

On which criteia nutch selects TopN docs while crawling?

On which criteia nutch selects TopN docs while crawling? And how nutch creates segments.?
Here are the things that are taken into account:
The score of the url
how many urls belonging to the same host are allowed to be crawled.
Is the re-fetch time of the url reached ?

Nutch issues with crwaling website where the url differes only in termes of parameters passes

I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other.
The urls on my webiste are basically of this format
http://mysite.com/index.php?main_page=index&params=12
http://mysite.com/index.php?main_page=index&category=tub&param=17
i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)
Is Nutch unable to crawl such webistes?
What Nutch settings should I do in order to crawl such websites?
I got the issue fixed.
It had everything to do with the url filter set as
skip URLs containing certain characters as probable queries, etc
-[?*!#=]
I commented this filter and Nutch crawle dall urls :)

Resources