How to make nutch index only pages with certain text?

How to make nutch index only pages with certain text? - nutch

I have 2 requirements.
First one is that I want Nutch to index only pages that contain certain words in the html. For example I only want nutch to index pages that contain "wounderful" word in the html.
Second one is that I want nutch to index certain URLs from the site. For example I want nutch to index URLs that are similar to "mywebsite.com/XXXX/ABC/XXXX" or "mywebsite.com/grow.php/ABC/XXXX", where "XXXX" can be any word of any length.
This is the content of my seed.txt file
http://mysite.org/
this is the content of my regex-urlfilter.txt
+^http://mysite.org/work/.*?/text/
I have commented
#+.
By I am still getting below error
crawl started in: crawl
rootUrlDir = bin/urls
threads = 10
depth = 3
solrUrl=http://localhost:8983/solr/
topN = 5
Injector: starting at 2013-07-09 11:05:51
Injector: crawlDb: crawl/crawldb
Injector: urlDir: bin/urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-07-09 11:06:08, elapsed: 00:00:17
Generator: starting at 2013-07-09 11:06:08
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

Start here to setup your desired URL pattern. Then look into plugins to parse your content and decide what should be indexed.

It shows Injector rejects your url in seed file
Injector: total number of urls rejected by filters: 1
Your regex is not working or there will be any other patterns which rejects your url like -.*(/[^/]+)/[^/]+\1/[^/]+\1/ or -[?*!#=]

Know this is pretty old, but just wanted to add my two cents to the topic related the crawling vs. indexing filter, for nutch-1.13
regex-urlfilter testing
If you want to test your regex-urlfilter.txt expressions you can use the plugin testing like this
$ bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter
This will give no feedback, but if you type urls and press enter, you'll see an echo of it, with a '-' or '+' prefix, telling you if the url passes the configuration filter.
such as
http://aaa.com
-http://aaa.com
http://bbb.com
+http://bbb.com
if config is something like
+^http://bbb.com\.*
-.*
crawling filter vs. index filter
This is not well documented, and took me a while to found a clue.
If we want to make different filtering precision (broad on crawling, but more detailed on indexing) we can do the following.
First, if we're using the bin/crawl script, just add
a -filter option at the end of the filtering command
the param that specifies the regex file to use; -Durlfilter.regex.file)
like this
< __bin_nutch index $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
> __bin_nutch index $JAVA_PROPERTIES -Durlfilter.regex.file=regex-urlfilter-index.txt "$CRAWL_PATH"/crawldb -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT -filter
Otherwise, just append both parameters to the bin/nutch index command, if you're using them without the crawl script
And now, enter the desired configuration in the 'regex-urlfilter-index.txt' file.
Thanks to Arthurs question in grokbase for the insight:
http://grokbase.com/t/nutch/user/1579evs40h/filtering-at-index-time-with-a-different-regex-urlfilter-txt-from-crawl

Related

Dynamic test tag pattern execution in karate [duplicate]

I'm wondering if you can use wildcard characters with tags to get all tagged scenarios/features that match a certain pattern.
For example, I've used 17 unique tags on many scenarios throughout many of my feature files. The pattern is "#jira=CIS-" followed by 4 numbers, like #jira=CIS-1234 and #jira=CIS-5678.
I'm hoping I can use a wildcard character or something that will find all of the matches for me.
I want to be able to exclude them from being run, when I run all of my features/scenarios.
I've tried the follow:
--tags ~#jira
--tags ~#jira*
--tags ~#jira=*
--tags ~#jira=
Unfortunately none have given my the results I wanted. I was only able to exclude them when I used the exact tag, ex. ~#jira=CIS-1234. It's not a good solution to have to add each single one (of the 17 different tags) to the command line. These tags can change frequently, with new ones being added and old ones being removed, plus it would make for one real long command.

Yes. First read this - there is this un-documented expression-language (based on JS) for advanced tag selction based on the #key=val1,val2 form: https://stackoverflow.com/a/67219165/143475
So you should be able to do this:
valuesFor('#jira').isPresent
And even (here s will be a string, on which you can even do JS regex if you know how):
valuesFor('#jira').isEach(s => s.startsWith('CIS-'))
Would be great to get your confirmation and then this thread itself can help others and we can add it to the docs at some point.

Nutch says No URLs to fetch - check your seed list and URL filters

~/runtime/local/bin/urls/seed.txt >>
http://nutch.apache.org/
~/runtime/local/conf/nutch-site.xml >>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>http.timeout</name>
<value>99999999</value>
<description></description>
</property>
<property>
<value>protocol-file|protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
scoring-opic|urlnormalizer-(pass|regex|basic)|index-more
</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
</description>
</property>
</configuration>
~/runtime/local/conf/regex-urlfilter.txt >>
# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)*
If I crawl, it says like this.
/home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch crawl urls -dir newCrawl/ -depth 3 -topN 3
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: newCrawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 3
Injector: starting at 2014-07-18 11:35:36
Injector: crawlDb: newCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-07-18 11:35:39, elapsed: 00:00:02
Generator: starting at 2014-07-18 11:35:39
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl
No matter what the web adresses are, it always says no urls to fetch.
I am struggling with this problem for 3 days. Please Help!!!!

I was looking at your regex-filter and I spotted a few glitches that you might think about giving a try. Since it won't fit well into the comment, I will post it here anyway even it might not be the complete answer.
Your customized regular expression +^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)* might be the problem. Nutch's regex-urlfilter sometimes can get really confusing and I would highly recommend you start with something that works for everyone, maybe +^http://([a-z0-9]*\.)*nutch.apache.org/ from Wiki just to get started.
After the two steps above, and you are sure Nutch is working, then you can tweak the regex.
To test the regex, I found two ways to do it:
feed a list of URLs to be the seed. And inject them to a new database and see who has been injected or rejected. This doesn't really any coding.
You can set up Nutch in Eclipse and call the corresponding class to test it.

Config for find url in html content

Can anybody help me to configure Sphinx for best matching url (part of url) in html content?
My config:
index base_index
{
docinfo = extern
mlock = 0
morphology = none
min_word_len = 3
charset_type = utf-8
charset_table = 0..9, A..Z->a..z, a..z
enable_star = 1
blend_chars = _, -, #, /, .
html_strip = 0
}
I use SphinxAPI on backend (PHP) with SPH_MATCH_EXTENDED mode.
I don't understand how search works. If I find "domain.com" I have 37 results. If "www.domain.com" - 643 results. But why? The "domain.com" is needle of "www.domain.com" and in theory with first query a have to get more results.
FreeBSD 9.2. Sphinx 2.1.2
16 distributed indexes (147Gb)

This is a bit late, but here's my thoughts anyway.
It looks like when you search www.domain.com, sphinx is actually looking for www domain and com respectively. If you're searching for just domain.com, it's just looking for domain and com. This is probably the reason why www.domain.com returns more results, because www appears more frequently throughout the index.
Since you're searching URLs, I would setup stopwords depending on how you want to search. For me, I would make www com org and basically all top-level domains stopwords. You might want to leave the top-level domains and just make www a stopword. This would allow you to weight com higher than a net in a search result.
If you setup your stopwords right, when someone searches domain.com sphinx actually just looks for hits of domain in the index, whether it be domain.com or domain.org or domain.net.

When do I use solrindex [-filter] and [-normalize]?

In the Nutch wiki it suggests use of the following:
bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
What is the purpose of
[-filter] [-normalize]
when Nutch has numerous filter and normalization configuration files?
automaton-urlfilter.txt
domain-urlfilter.txt
regex-urlfilter.txt
suffix-urlfilter.txt
regex-normalize.xml
host-urlnormalizer.txt

When indexing to Solr these config files are set to false by default, so if you wish the indexes you are passing to Solr to be normalized or filetered then you would enable these options.
To me it seems like a pointless option but only because that is not how I would like my configuration of Solr to work, but it is a more advanced feature that will benefit a small amount of people

How to make Apache Nutch indexing while crawling

I started using Apache Nutch (v1.5.1) to index all the website under some certain domain.
There is huge number of websites (in the order of milions) in my domains and I need to index them step by step instead of waiting the end of the whole process.
I found this in nutch wiki (here http://wiki.apache.org/nutch/NutchTutorial/#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling) something that should work. The idea is to make a script witch calls every single step of my process (crawl, fetch, parse, ...) on a certain amount of data (for example 1000 URL) cyclically.
bin/nutch inject crawl/crawldb crawl/seed.txt
bin/nutch generate crawl/crawldb crawl/segments -topN 25
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch generate crawl/crawldb crawl/segments -topN 25
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2
...
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
My question is: is there any way to specify this setting directly into Nutch and make him do this stuff in a parallel and more trasparent way? For example on separated threds?
Thank for answering.
UPDATE
I tried to create the script (the code is above) but unfortunatlly I get an error on the invert link phases. This is the output:
LinkDb: starting at 2012-07-30 11:04:58
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/home/apache-nutch-1.5-bin/crawl/segments/20120730102927
LinkDb: adding segment: file:/home/apache-nutch-1.5-bin/crawl/segments/20120704094625
...
LinkDb: adding segment: file:/home/apache-nutch-1.5-bin/crawl/segments/20120704095730
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/apache-nutch-1.5-bin/crawl/segments/20120730102927/parse_data
Input path does not exist:
file:/home/apache-nutch-1.5-bin/crawl/segments/20120704094625/parse_data
...
Thanks for your help.

(If I had enough rep I would post this as a comment).
Remember that the -depth switch refers to EACH CRAWL, and not the overall depth of the site it will crawl. That means that the second run of depth=1 will descend one MORE level from the already indexed data and stop at topN documents.
So, If you aren't in a hurry to fully populate the data, I've had a lot of success in a similar situation by performing a large number of repeated shallow nutch crawl statements (using smallish -depth (3-5) and -topN (100-200) variables) from a large seed list. This will ensure that only (depth * topN) pages get indexed in each batch, and the index will start delivering URLs within a few minutes.
Then, I typically set up the crawl to fire off every (1.5*initial crawl time average) seconds and let it rip. Understandably, at only 1,000 documents per crawl, it can take a lot of time to get through a large infrastructure, and (after indexing, the paused time and other overhead) the method can multiply the time to crawl the whole stack.
First few times through the infrastructure, it's a pretty bad slog. As the adaptive crawling algo starts to kick in, however, and the recrawl times start to approach reasonable values : the package starts really delivering.
(This is somewhat similar to the "whole web crawling" method you mention in the nutch wiki, which advises you to break the data into 1,000 page segments, but much more terse and understandable for a beginner.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string