stopwords and synonyms in nutch - nutch

is there is any option to configure stop words and synonyms in nutch crawler
synonyms
gov-->government
some thing similar to this`

It's no very easy but if you are going to be ready of writing your own Indexer plugin I think yes.

Related

How to add a new language Support in Lucene Solr

It seems to me that lucene currently does not support bengali; how can i add support for bengali content?
and what nlp methods(stemming, tokenizing etc.) lucene uses for indexing and search?
my content will be mainly in bengali and with some english words
thanks
edit: it seems there are some info here, but it does not contain enough detail.
http://wiki.apache.org/lucene-java/IndexingOtherLanguages

keyword search in sites

I am planning to build a small social networking site. What is the best way to support keyword search in the content. I am looking for an opinion considering the fact that the contents can grow few TBs in size.
thanks,
GL
You should definitely use Solr/Lucene to index contents resulting in efficient keyword search results and it is also very easy to implement a faceted search based on Solr if you have such a feature in your mind.
Have you looked at Apache Lucene?
It's a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Polish search for Sphinx?

I want to implement a search solution for a website written in Django. From the available options (I have researched Solr, Sphinx, Xapian, PostgreSQL/Tsearch3, MySQL) Sphinx looks like the nicest. However, it does not support stemming for Polish, and that is the language of the data that I want to make searchable.
What are the best ways of dealing with unsupported languages in Sphinx? I have an intuition that I could create a stemming corpus from the Ispell dictionary. How can I make that work with Sphinx?
Search in http://snowball.tartarus.org/ mailist , you might find some info if someone tried to create a polish stemmer . There are 2 free stemmers available , but they are made in java ( I think at least one is made for solr/lucene) . From Ispell , I'm not sure if the stemming corpus can help you , you could create files to be used for wordforms or excepts .

Search term suggestions

This question has been asked in various ways before, but I'm wondering if people who have experience with automatic search term suggestion could offer advice on the most useful and efficient approaches. Here's the scenario:
I'm just starting on a website for a book that is a dictionary of terms (roughly 1,000 entries, with 300 word explanations on average), many of which are fairly obscure, and it is likely that many visitors to the site would not know how to spell the words. The publisher wants to make full-text search available for every entry. So, I'm hoping to implement a search engine with spelling correction. The main site will probably be done in a PHP framework (or possibly Django) with a MySQL database.
Can anyone with experience in this area give advice on the following:
With a set corpus of this nature, should I be using something like Lucene or Sphinx for the search engine?
As far as I can tell, neither of these has a built-in suggestion function. So it seems I will need to integrate one or more of the following. What are the advantages / disadvantages of:
Suggestion requests through Google's search API
A phonetic comparison algorithm like metaphone() in PHP
A spell checking system like Aspell
A simpler spelling script such as Peter Norvig's
A Levenshtein function
I'm concerned about the specificity of my corpus, and don't want Google to start suggesting things that have nothing to do with this book. I'm also not sure whether I should try to use both a metaphone comparison and a Levenshtein comparison, or some other combination of techniques to capture both typos and attempts at phonetic spelling.
You might want to consider Apache Solr, which is a web service encapsulation of Lucene, and runs in a J2EE container like Tomcat. You'll get term suggestion, spell check, porting, stemming and much more. It's really very nice.
See here for a full listing of its features relating to queries.
There are Django and PHP libraries for Solr.
I wouldn't recommend using Google Suggest for such a specialised corpus anyway, and with Solr you won't need it.
Hope this helps.

Natural Language Processing Package

I have started working on a project which requires Natural Language Processing. We have do the spell checking as well as mapping sentences to phrases and their synonyms. I first thought of using GATE but i am confused on what to use? I found an interesting post here which got me even more confused.
http://lordpimpington.com/codespeaks/drupal-5.1/?q=node/5
Please help me decide on what suits my purpose the best. I am working a web application which will us this NLP tool as a service.
You didn't really give much info, but try this: http://www.nltk.org/
I don't think NLTK does spell checking (I could be wrong on this), but it can do parts of speech tagging for text input.
For finding/matching synonyms you could use something like WordNet http://wordnet.princeton.edu/
If you're doing something really domain specific: I would recommend coming up with your own ontology for domain specific terms.
If you are using Python you can develop a spell checker with Python Enchant.
NLTK is good for developing Sentiment Analysis system too. I have some prototypes of the same too
Jaggu
If you are using deep learning based models, and if you have sufficient data, you can implement task specific models for any purpose. With the development of deep leaning based languages models, you can used word embedding based models with lexicon resources to obtain synonyms and antonyms. You can also follow the links below to obtain more resources.
https://stanfordnlp.github.io/CoreNLP/
https://www.nltk.org/
https://wordnet.princeton.edu/

Resources