Sentences as documents in Nutch - search

I need Nutch to split web pages into sentences when saving the crawl results. The reason is so that Solr sees each sentence as a document when indexing.
The result I need is to be able to do a search for, say, "one word" and get a list of all sentences that contain "one" and/or "word".
I'm new to Nutch so some pointers would really be useful...
Should I look into Nutch configuration files?
Do I need to change Nutch source code?
Or can I write a separate app that can edit the crawl results once Nutch is done crawling?

Yes, you can check out Nutch for your task.
1) configuration files alone will not do the job for you. see points above.
2) you'd need to write your own Parser plugin that hooks to nutch parsing phase after crawls, split your HTMLpage into sentences and return N results from a single page. This is quite odd as usually one page is one result. Check out the FeedParser to see how to return multiple results from one page.
3) in principle, you could iterate over the pages fetched by nutch, get the text, split them in sentences and use SOLR api to index your sentences as if they were docs. this could even be a mapreduce job quite easily.
As a general reference I suggest you have a look at this article for splitting your text in sentences:
http://sujitpal.blogspot.com/2011/04/uima-sentence-annotator-using-opennlp.html

Related

How can I get the tags of stackoverflow for solr index?

Recently, I used nutch-1.11 and solr-4.10.4 to set up a crawler, I can crawl data by sequential nutch commands, but now my problem is how can I to fetch the specified data, like tags of questions of stackoverflow for example, then I can use these data for solr indexing for my some purpose? I try to configure and modify the "local/conf/nutch-site" but doesn't work for me, I'm a newer for Nnutch!
Nutch fetches urls, so what you could do is point it to a page which might contain all the links to the questions with that tag.
For example
https://stackoverflow.com/questions/tagged/nutch?sort=newest, this page contains links to all questions having Nutch as the tag. Now by crawling 2 or more rounds will make Nutch fetch all outlinks from this page.

From a pool of webpages, finding pages similar to any given webpage

I am given a set of webpages and I need to build a page recommender. Whichever URL is given to the application, the application should be able to find out pages from the given pool that are similar to the page at the URL.
I tried looking for different approaches to do that. The use of word2vec interested me. I am planning to crawl through all the given set of webpages and generate tags for that page based on the content in that page. From these tags I was hoping to use word2vec to calculate a vector value for the page and store it. When searching, I would caclulate vector for the given page in similar way to look for similar values. Is this the correct way of using word2vec? What training vector should be used? Any other better way to do this task?Or just plain text matching would be a better option?
I'd recommend using existing IR open source to handle your documents i.e. to index your crawled webpages and to query to get the results.
You can try to index document using elastic index all webpages and to query using more like this query, from elastic documentation:
The More Like This Query (MLT Query) finds documents that are "like" a given set of documents

How to parse a document using crawler4j

I wanted to parse all the documents containing some text I enter as "query" using crawler4j in Eclipse.
Any ideas?
Not really a "direct" answer, but I also played with crawling these last few days. I looked first at Crawler4J, then stumbled on JSoup. Did not play much with the crawler, but jSoup turns out to be quite an easy tool for parsing. Hence my suggestion. I guess crawler is good if you really need to crawl a part of the web. But JSoup really seems to shine as a good parser. Similar to JQuery in terms of selecting nodes etc... So perhaps use the crawler for first collecting documents, then parse them using JSoup. Here's a quick example:
Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").timeout(5000)
.get();
Elements els = doc.select("li");

Query wikipedia

I would like to query two or three terms in order to locate them in Wikipedia´s entries. Specifically, I´m trying to see if some terms get repeated in the first paragraphs (abstract) across entries. Could be direct or through dbpedia. Thanks
Using Mediawiki API you can find articles that contain those keywords.
Try the API:Search documentation.
For doing what you want to do, also, you'd probably need to find the articles that have those keywords and then parse the text to check if they are in the first paragraphs.
With this:
?action=parse&page=Nicolas_Cage&prop=text&section=0
you can get the HTML of the first section of a page (see this post).

Solr behind Drupal returns too many results for specific query

We've got Solr sat behind one of our client's Drupal 7 websites, and while it's working well, it returns too many results for what should be quite specific queries. (It also has relevance/weighting problems; but I'm hoping that solving this problem will remove the - literally - irrelevant results.)
For example, searching for the phrase 'particular phrase in london' should return the node with that as its title, quite high up; I don't even think that any other content should be returned. But I find that it's returning lots of content, purely on the fact that it mentions "London"!
Frivolously, searching for the ridiculous phrase 'piecrusts in london' returns a lot of results too, apparently just because they mention London. No content on the site mentions actual piecrusts.
When I search for 'particular phrase in london', here are the parameters that end up in the catalina.out log on the server (whitespace added for clarity):
{spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1
&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR
&spellcheck.q=particular+phrase+in+london
&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0
&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200
&facet.date=ds_created
&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR
&f.bundle.facet.mincount=1&hl.fl=content,ts_comments
&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,
label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,
tos_name,tm_node,zs_entity
&start=0&facet.sort=count&f.bundle.facet.limit=50&q=special+phrase+in+london
&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR
&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0
&facet.field=im_field_health_topic&facet.field=bundle
&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50}
hits=1998 status=0 QTime=14
Note that these parameters have been built by Drupal's Apache Solr module; I don't believe we've got any particular custom code of our own that's doing anything to it.
This corresponds to the following URL, if entered directly in the browser:
http://example.com:8081/solr/CLIENT/select?spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR&spellcheck.q=particular+phrase+in+London&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200&facet.date=ds_created&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR&f.bundle.facet.mincount=1&hl.fl=content,ts_comments&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,tm_node,zs_entity&start=0&facet.sort=count&f.bundle.facet.limit=50&q=particular+phrase+in+London&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0&facet.field=im_field_health_topic&facet.field=bundle&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50
This URL returns nearly 2000 results - that's most of the content on the site! I've experimented with removing each query parameter at a time, and the only one to make any difference seems to be qf and q: if I remove qf, zero results; if I remove q, I get more results back!
I guess there are two questions here:
Is there anything in these parameters that tell Solr "don't worry if 'particular phrase', or 'piecrusts' appears: just collate the results for 'london'" and then order by relevancy? I would add that I think 'in' is mentioned in the stopwords file, so we can probably ignore the effect of that (?)
Or is this something in the (standard Drupal) schema that I need to change?
I appreciate that sometimes search is better for the visitor if it's inclusive; Google does return results even if it doesn't find perfect matches. But, stopwords and stemming aside, the client does require that searches return only results where all words appear in the content.
As mentioned in the post at http://drupal.org/node/1783454, the Apache Solr Search Integration module makes use of the mm param, which is more or less configured to effect rankings by how closely the keywords are in the dataset. Looking through the docs there are other ways you can use the parameter to effect rankings as well. Therefore the results produced by Apache Solr Search Integration are weighted more closely to the AND operator even though it will return more results as you add more keywords. The benefit of this param is that in cases where the user enters keywords that are too restrictive, results will still be returned. Displaying no results is a really quick way to guide people away from your site.
How are you displaying the search ?
Maybe you could solr views to limit the search range ?
http://drupal.org/project/apachesolr_views
thanks
Nick

Resources