How to add a new language Support in Lucene Solr - search

It seems to me that lucene currently does not support bengali; how can i add support for bengali content?
and what nlp methods(stemming, tokenizing etc.) lucene uses for indexing and search?
my content will be mainly in bengali and with some english words
thanks
edit: it seems there are some info here, but it does not contain enough detail.
http://wiki.apache.org/lucene-java/IndexingOtherLanguages

Related

Well documented NLP libraries in any language supporting Slovan languages?

Do you have any tips of well documented, developer friendly NLP libraries for text analysis (morphology, text concept) for Slovan languages like Czech, Polish etc?
The API could be in any language - java, python, c, node, whatever.
Nice lib for stemming as an example could be this one: https://github.com/dundalek/czech-stemmer
I am studying the best options for text analysis. I want to be able to get most out of a sentence in specific topic. Let's say that i will have medical sentence and thanks to my dictionary words in the databases I will be able to do analysis based on NLP algorithm.
Thanks!
Try polyglot. it supports both Polish and Czech.

Xapian vs Apache Solr

I'm trying to get a good natural language search going in a website, and trying to understand the advantages of Apache Solr vs Xapian. Xapian seems easier to set up. Do both offer good natural language searches? Any insight appreciated.
Xapian is more like Lucene, a library that you integrate with your application. If you have a C++ app, then Xapian might be a better match. If you have a Java application, Lucene is almost certainly the best choice.
If you want a search server, then compare Omega (built on Xapian) to Solr (built on Lucene). I have not used Omega or Xapian, but Solr has a few features that I have come to depend on, especially the per-field analysis chains. That is a brilliant idea, and one that I wish I had thought of when I was working on Ultraseek.
It is quite easy to extend the Solr analysis chain with your own Java class. I expect that would be more difficult in C++ with Omega/Xapian.
The two engines use different underlying relevance models. Xapian is a probabilistic engine, Lucene is a vector space engine. I have seen both models tuned to perform well, so that might not be a reason to decide.
The Solr/Lucene community is large and very helpful.

keyword search in sites

I am planning to build a small social networking site. What is the best way to support keyword search in the content. I am looking for an opinion considering the fact that the contents can grow few TBs in size.
thanks,
GL
You should definitely use Solr/Lucene to index contents resulting in efficient keyword search results and it is also very easy to implement a faceted search based on Solr if you have such a feature in your mind.
Have you looked at Apache Lucene?
It's a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Polish search for Sphinx?

I want to implement a search solution for a website written in Django. From the available options (I have researched Solr, Sphinx, Xapian, PostgreSQL/Tsearch3, MySQL) Sphinx looks like the nicest. However, it does not support stemming for Polish, and that is the language of the data that I want to make searchable.
What are the best ways of dealing with unsupported languages in Sphinx? I have an intuition that I could create a stemming corpus from the Ispell dictionary. How can I make that work with Sphinx?
Search in http://snowball.tartarus.org/ mailist , you might find some info if someone tried to create a polish stemmer . There are 2 free stemmers available , but they are made in java ( I think at least one is made for solr/lucene) . From Ispell , I'm not sure if the stemming corpus can help you , you could create files to be used for wordforms or excepts .

Search term suggestions

This question has been asked in various ways before, but I'm wondering if people who have experience with automatic search term suggestion could offer advice on the most useful and efficient approaches. Here's the scenario:
I'm just starting on a website for a book that is a dictionary of terms (roughly 1,000 entries, with 300 word explanations on average), many of which are fairly obscure, and it is likely that many visitors to the site would not know how to spell the words. The publisher wants to make full-text search available for every entry. So, I'm hoping to implement a search engine with spelling correction. The main site will probably be done in a PHP framework (or possibly Django) with a MySQL database.
Can anyone with experience in this area give advice on the following:
With a set corpus of this nature, should I be using something like Lucene or Sphinx for the search engine?
As far as I can tell, neither of these has a built-in suggestion function. So it seems I will need to integrate one or more of the following. What are the advantages / disadvantages of:
Suggestion requests through Google's search API
A phonetic comparison algorithm like metaphone() in PHP
A spell checking system like Aspell
A simpler spelling script such as Peter Norvig's
A Levenshtein function
I'm concerned about the specificity of my corpus, and don't want Google to start suggesting things that have nothing to do with this book. I'm also not sure whether I should try to use both a metaphone comparison and a Levenshtein comparison, or some other combination of techniques to capture both typos and attempts at phonetic spelling.
You might want to consider Apache Solr, which is a web service encapsulation of Lucene, and runs in a J2EE container like Tomcat. You'll get term suggestion, spell check, porting, stemming and much more. It's really very nice.
See here for a full listing of its features relating to queries.
There are Django and PHP libraries for Solr.
I wouldn't recommend using Google Suggest for such a specialised corpus anyway, and with Solr you won't need it.
Hope this helps.

Resources