How to make Apache Nutch indexing while crawling

How to make Apache Nutch indexing while crawling - nutch

I started using Apache Nutch (v1.5.1) to index all the website under some certain domain.
There is huge number of websites (in the order of milions) in my domains and I need to index them step by step instead of waiting the end of the whole process.
I found this in nutch wiki (here http://wiki.apache.org/nutch/NutchTutorial/#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling) something that should work. The idea is to make a script witch calls every single step of my process (crawl, fetch, parse, ...) on a certain amount of data (for example 1000 URL) cyclically.
bin/nutch inject crawl/crawldb crawl/seed.txt
bin/nutch generate crawl/crawldb crawl/segments -topN 25
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch generate crawl/crawldb crawl/segments -topN 25
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2
...
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
My question is: is there any way to specify this setting directly into Nutch and make him do this stuff in a parallel and more trasparent way? For example on separated threds?
Thank for answering.
UPDATE
I tried to create the script (the code is above) but unfortunatlly I get an error on the invert link phases. This is the output:
LinkDb: starting at 2012-07-30 11:04:58
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/home/apache-nutch-1.5-bin/crawl/segments/20120730102927
LinkDb: adding segment: file:/home/apache-nutch-1.5-bin/crawl/segments/20120704094625
...
LinkDb: adding segment: file:/home/apache-nutch-1.5-bin/crawl/segments/20120704095730
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/apache-nutch-1.5-bin/crawl/segments/20120730102927/parse_data
Input path does not exist:
file:/home/apache-nutch-1.5-bin/crawl/segments/20120704094625/parse_data
...
Thanks for your help.

(If I had enough rep I would post this as a comment).
Remember that the -depth switch refers to EACH CRAWL, and not the overall depth of the site it will crawl. That means that the second run of depth=1 will descend one MORE level from the already indexed data and stop at topN documents.
So, If you aren't in a hurry to fully populate the data, I've had a lot of success in a similar situation by performing a large number of repeated shallow nutch crawl statements (using smallish -depth (3-5) and -topN (100-200) variables) from a large seed list. This will ensure that only (depth * topN) pages get indexed in each batch, and the index will start delivering URLs within a few minutes.
Then, I typically set up the crawl to fire off every (1.5*initial crawl time average) seconds and let it rip. Understandably, at only 1,000 documents per crawl, it can take a lot of time to get through a large infrastructure, and (after indexing, the paused time and other overhead) the method can multiply the time to crawl the whole stack.
First few times through the infrastructure, it's a pretty bad slog. As the adaptive crawling algo starts to kick in, however, and the recrawl times start to approach reasonable values : the package starts really delivering.
(This is somewhat similar to the "whole web crawling" method you mention in the nutch wiki, which advises you to break the data into 1,000 page segments, but much more terse and understandable for a beginner.)

Related

Nutch says No URLs to fetch - check your seed list and URL filters

~/runtime/local/bin/urls/seed.txt >>
http://nutch.apache.org/
~/runtime/local/conf/nutch-site.xml >>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>http.timeout</name>
<value>99999999</value>
<description></description>
</property>
<property>
<value>protocol-file|protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
scoring-opic|urlnormalizer-(pass|regex|basic)|index-more
</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
</description>
</property>
</configuration>
~/runtime/local/conf/regex-urlfilter.txt >>
# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)*
If I crawl, it says like this.
/home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch crawl urls -dir newCrawl/ -depth 3 -topN 3
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: newCrawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 3
Injector: starting at 2014-07-18 11:35:36
Injector: crawlDb: newCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-07-18 11:35:39, elapsed: 00:00:02
Generator: starting at 2014-07-18 11:35:39
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl
No matter what the web adresses are, it always says no urls to fetch.
I am struggling with this problem for 3 days. Please Help!!!!

I was looking at your regex-filter and I spotted a few glitches that you might think about giving a try. Since it won't fit well into the comment, I will post it here anyway even it might not be the complete answer.
Your customized regular expression +^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)* might be the problem. Nutch's regex-urlfilter sometimes can get really confusing and I would highly recommend you start with something that works for everyone, maybe +^http://([a-z0-9]*\.)*nutch.apache.org/ from Wiki just to get started.
After the two steps above, and you are sure Nutch is working, then you can tweak the regex.
To test the regex, I found two ways to do it:
feed a list of URLs to be the seed. And inject them to a new database and see who has been injected or rejected. This doesn't really any coding.
You can set up Nutch in Eclipse and call the corresponding class to test it.

When do I use solrindex [-filter] and [-normalize]?

In the Nutch wiki it suggests use of the following:
bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
What is the purpose of
[-filter] [-normalize]
when Nutch has numerous filter and normalization configuration files?
automaton-urlfilter.txt
domain-urlfilter.txt
regex-urlfilter.txt
suffix-urlfilter.txt
regex-normalize.xml
host-urlnormalizer.txt

When indexing to Solr these config files are set to false by default, so if you wish the indexes you are passing to Solr to be normalized or filetered then you would enable these options.
To me it seems like a pointless option but only because that is not how I would like my configuration of Solr to work, but it is a more advanced feature that will benefit a small amount of people

How to make nutch index only pages with certain text?

I have 2 requirements.
First one is that I want Nutch to index only pages that contain certain words in the html. For example I only want nutch to index pages that contain "wounderful" word in the html.
Second one is that I want nutch to index certain URLs from the site. For example I want nutch to index URLs that are similar to "mywebsite.com/XXXX/ABC/XXXX" or "mywebsite.com/grow.php/ABC/XXXX", where "XXXX" can be any word of any length.
This is the content of my seed.txt file
http://mysite.org/
this is the content of my regex-urlfilter.txt
+^http://mysite.org/work/.*?/text/
I have commented
#+.
By I am still getting below error
crawl started in: crawl
rootUrlDir = bin/urls
threads = 10
depth = 3
solrUrl=http://localhost:8983/solr/
topN = 5
Injector: starting at 2013-07-09 11:05:51
Injector: crawlDb: crawl/crawldb
Injector: urlDir: bin/urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-07-09 11:06:08, elapsed: 00:00:17
Generator: starting at 2013-07-09 11:06:08
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

Start here to setup your desired URL pattern. Then look into plugins to parse your content and decide what should be indexed.

It shows Injector rejects your url in seed file
Injector: total number of urls rejected by filters: 1
Your regex is not working or there will be any other patterns which rejects your url like -.*(/[^/]+)/[^/]+\1/[^/]+\1/ or -[?*!#=]

Know this is pretty old, but just wanted to add my two cents to the topic related the crawling vs. indexing filter, for nutch-1.13
regex-urlfilter testing
If you want to test your regex-urlfilter.txt expressions you can use the plugin testing like this
$ bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter
This will give no feedback, but if you type urls and press enter, you'll see an echo of it, with a '-' or '+' prefix, telling you if the url passes the configuration filter.
such as
http://aaa.com
-http://aaa.com
http://bbb.com
+http://bbb.com
if config is something like
+^http://bbb.com\.*
-.*
crawling filter vs. index filter
This is not well documented, and took me a while to found a clue.
If we want to make different filtering precision (broad on crawling, but more detailed on indexing) we can do the following.
First, if we're using the bin/crawl script, just add
a -filter option at the end of the filtering command
the param that specifies the regex file to use; -Durlfilter.regex.file)
like this
< __bin_nutch index $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
> __bin_nutch index $JAVA_PROPERTIES -Durlfilter.regex.file=regex-urlfilter-index.txt "$CRAWL_PATH"/crawldb -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT -filter
Otherwise, just append both parameters to the bin/nutch index command, if you're using them without the crawl script
And now, enter the desired configuration in the 'regex-urlfilter-index.txt' file.
Thanks to Arthurs question in grokbase for the insight:
http://grokbase.com/t/nutch/user/1579evs40h/filtering-at-index-time-with-a-different-regex-urlfilter-txt-from-crawl

How can I get the date of the last Full Crawl in Sharepoint 2010 using Powershell?

I can run the following to get the current crawls and from there determine the last crawl completed date.
# Get the Search App from Sharepoint
$searchApp = Get-SPEnterpriseSearchServiceApplication "My Search Service"
Get-SPEnterpriseSearchCrawlContentSource -SearchApplication $searchapp
$contentsource = Get-SPEnterpriseSearchCrawlContentSource "MyCrawl" -SearchApplication $searchApp
$contentsource.CrawlCompleted
But this is the last time any crawl completed. I want the date of the last Full crawl.
I can see the information in the crawl History. But when I try and get the crawl history (see http://blogs.msdn.com/b/carloshm/archive/2009/03/31/how-to-programmatically-export-the-crawl-history-to-a-csv-file-in-powershell.aspx) using the below I don't seem to get an object I can really work with (its one big string container as far as I can tell) and it is full of IDs.
$s = new-Object Microsoft.SharePoint.SPSite("http://portal");
$c = [Microsoft.Office.Server.Search.Administration.SearchContext]::GetContext($s);
$h = new-Object Microsoft.Office.Server.Search.Administration.CrawlHistory($c)
I was hoping to get an object that represents the crawl history which I could then filter on crawl name and Type = full.
I have searched around and can't find an answer anywhere. (Note also that the CrawlHistory class is being deprecated).
Any thoughts/suggestions?

You're close. Just need to then call this:
$h.GetCrawlHistory() and parse it.

Compare two websites and see if they are "equal?"

We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?

Get the formatted output of both sites (here we use w3m, but lynx can also work):
w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html
Then use wdiff, it can give you a percentage of how similar the two texts are.
wdiff -nis /tmp/1.html /tmp/2.html
It can be also easier to see the differences using colordiff.
wdiff -nis /tmp/1.html /tmp/2.html | colordiff
Excerpt of output:
Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion
Google [hp1] [hp2]
[hp3] [-Français-] {+Deutschland+}
[ ] Recherche
avancéeOutils
[Recherche Google][J'ai de la chance] linguistiques
/tmp/1.html: 43 words 39 90% common 3 6% deleted 1 2% changed
/tmp/2.html: 49 words 39 79% common 9 18% inserted 1 2% changed
(he actually put google.com into french... funny)
The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).

The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.
IF the pages have dynamic content you will have to download the site using a tool like wget
wget --mirror http://thewebsite/thepages
and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.

I've created the following PHP code that does what Weboide suggest here. Thanks Weboide!
the paste is here:
http://pastebin.com/0V7sVNEq

Using the open source tool recheck-web (https://github.com/retest/recheck-web), there are two possibilities:
Create a Selenium test that checks all of your URLs on the old server, creating Golden Masters. Then running that test on the new server and find how they differ.
Use the free and open source (https://github.com/retest/recheck-web-chrome-extension) Chrome extension, that internally uses recheck-web to do the same: https://chrome.google.com/webstore/detail/recheck-web-demo/ifbcdobnjihilgldbjeomakdaejhplii
For both solutions you currently need to manually list all relevant URLs. In most situations, this shouldn't be a big problem. recheck-web will compare the rendered website and show you exactly where they differ (i.e. different font, different meta tags, even different link URLs). And it gives you powerful filters to let you focus on what is relevant to you.
Disclaimer: I have helped create recheck-web.

Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:
diff -r /tmp/directory1 /tmp/directory2
For all intents and purposes, you can put them in your preferred location with your preferred naming convention.
Edit 1
You could potentially use lynx -dump or a wget and run a diff on the results.

Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.
However, it is certainly possible to compare the downloaded website after downloading recursively with wget.
wget [option]... [URL]...
-m
--mirror
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
The next step would then be to do the recursive diff that Warner recommended.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string