Natural Language Processing Package - nlp

I have started working on a project which requires Natural Language Processing. We have do the spell checking as well as mapping sentences to phrases and their synonyms. I first thought of using GATE but i am confused on what to use? I found an interesting post here which got me even more confused.
http://lordpimpington.com/codespeaks/drupal-5.1/?q=node/5
Please help me decide on what suits my purpose the best. I am working a web application which will us this NLP tool as a service.

You didn't really give much info, but try this: http://www.nltk.org/
I don't think NLTK does spell checking (I could be wrong on this), but it can do parts of speech tagging for text input.
For finding/matching synonyms you could use something like WordNet http://wordnet.princeton.edu/
If you're doing something really domain specific: I would recommend coming up with your own ontology for domain specific terms.

If you are using Python you can develop a spell checker with Python Enchant.
NLTK is good for developing Sentiment Analysis system too. I have some prototypes of the same too
Jaggu

If you are using deep learning based models, and if you have sufficient data, you can implement task specific models for any purpose. With the development of deep leaning based languages models, you can used word embedding based models with lexicon resources to obtain synonyms and antonyms. You can also follow the links below to obtain more resources.
https://stanfordnlp.github.io/CoreNLP/
https://www.nltk.org/
https://wordnet.princeton.edu/

Related

Devanagaric text processing(NLP) where to start

I am new to Devnagaric NLP, Is there any group or resources that would help me get started with NLP in Devnagaric language(Mostly Nepali language or similar like Hindi). I want to be able to develop fonts for Devanagaric and also do some font processing application. If anyone (working in this field), could give me some advice then it would be highly appreciable.
Thanks in advance
I am new to Devnagaric NLP, Is there any group or resources that would help me get started with NLP in Devnagaric language(Mostly Nepali language or similar like Hindi)
You can use embeddings given by fasttext [https://fasttext.cc/docs/en/pretrained-vectors.html#content] and use some deep learning RNN models like LSTM for text-classification, sentiment analysis.
You can find some datasets for named entity recoginition here [http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5]
For Processing Indian languages, you can refer here [https://github.com/anoopkunchukuttan/indic_nlp_library]
Nltk supports the indian lanugages, for pos tagging and nlp related tasks you can refer here [http://www.nltk.org/_modules/nltk/corpus/reader/indian.html]
Is there any group or resources that would help me get started with NLP in Devnagaric language?
The Bhasa Sanchar project under Madan Puraskar Pustakalaya has developed a Nepali corpus. You may request a Nepali corpus for non-commerical purposes from the contact provided in the link above.
Python's NLTK has the Hindi Language corpus. You may import it using
from nltk.corpus import indian
For gaining insight to Devnagari based NLP, I suggest you go through research papers.Nepali being an under-resourced language;much work yet to be done, and it might be difficult to get contents for the same.
You should probably look into language detection,text classification,sentiment analysis among others (preferably based on POS tagging library from the corpus) for grasping the basics.
For the second part of the question
I am pretty sure font development doesn't come under the domain of Natural Language Processing. Did you mean something else?

Can anyone give a brief overview of how to proceed with Named Entity Recognition in Tamil Language?

I have looked at Stanford NER and Polyglot. Both does not support Tamil Language.
I would like to use ML along with some rule based NLP processing to do the entity recognition
Neither Stanford NER nor Polyglot are rule-based. If you are only considering rule-based systems, you should probably look for existing frameworks that process Tamil correctly, or head to generic ones (e.g. GATE).
Have a look at this paper that report existing NER systems for Tamil, you may contact authors.
If you find no system available, it should be rather easy to train one using existing datasets such as NER-FIRE2013 and NER-FIRE2014: ask organizers how it would be possible to obtain access to those datasets.
Hope this helps!

Accuracy: ANNIE vs Stanford NLP vs OpenNLP with UIMA

My work is planning on using a UIMA cluster to run documents through to extract named entities and what not. As I understand it, UIMA have very few NLP components packaged with it. I've been testing GATE for awhile now and am fairly comfortable with it. It does ok on normal text, but when we run it through some representative test data, the accuracy drops way down. The text data we have internally is sometimes all caps, sometimes all lowercase, or a mix of the two in the same document. Even using ANNIE's all caps rules, the accuracy still leaves much to be desired. I've recently heard of Stanford NLP and OpenNLP but haven't had time to extensively train and test them. How do those two compare in terms of accuracy with ANNIE? Do they work with UIMA like GATE does?
Thanks in advance.
It's not possible/reasonable to give a general estimate on performance of these systems. As you said, on your test data the accuracy declines. That's for several reasons, one is the language characteristics of your documents, another is characteristics of the annotations you are expecting to see. Afaik for every NER task there are similar but still different annotation guidelines.
Having that said, on your questions:
ANNIE is the only free open source rule-based NER system in Java I could find. It's written for news articles and I guess tuned for the MUC 6 task. It's good for proof of concepts, but getting a bit outdated. Main advantage is that you can start improving it without any knowledge in machine learning, nlp, well maybe a little java. Just study JAPE and give it a shot.
OpenNLP, Stanford NLP, etc. come by default with models for news articles and perform (just looking at results, never tested them on a big corpus) better than ANNIE. I liked the Stanford parser better than OpenNLP, again just looking at documents, mostly news articles.
Without knowing what your documents look like I really can't say much more. You should decide if your data is suitable for rules or you go the machine learning way and use OpenNLP or Stanford parser or Illinois tagger or anything. The Stanford parser seems more appropriate for just pouring your data, training and producing results, while OpenNLP seems more appropriate for trying different algorithms, playing with parameters, etc.
For your GATE over UIMA dispute, I tried both and found more viral community and better documentation for GATE. Sorry for giving personal opinions :)
Just for the record answering the UIMA angle: For both Stanford NLP and OpenNLP, there is excellent packaging as UIMA analysis engines available via the DKPro Core project.
I would like to add one more note. UIMA and GATE are two frameworks for the creation of Natural Language Processing(NLP) applications. However, Name Entity Recognition (NER) is a basic NLP component and you can find an implementation of NER, independent of UIMA and GATE. The good news is you can usually find a wrapper for a decent NER in the UIMA and GATE. To make it clear let see this example:
OpenNLP NER
A wrapper for OpenNLP NER in GATE
A wrapper for OpenNLP NER in UIMA
It is the same for the Stanford NER component.
Coming back to your question, this website lists the state of the art NERs:
http://www.aclweb.org/aclwiki/index.php?title=Named_Entity_Recognition_(State_of_the_art)
For example, in the MUC-7 competition, best participant named LTG got the result with the accuracy of 93.39%.
http://www.aclweb.org/aclwiki/index.php?title=MUC-7_(State_of_the_art)
Note that if you want to use such a state of are implementation, you may have some issue with their license.

Analysing meaning of sentences

Are there any tools that analyze the meaning of given sentences? Recommendations are greatly appreciated.
Thanks in advance!
I am also looking for similar tools. One thing I found recently was this sentiment analysis tool built by researchers at Stanford.
It provides a model of analyzing the sentiment of a given sentence. It's interesting and even this seemingly simple idea is quite involved to model in an accurate way. It utilizes machine learning to develop higher accuracy as well. There is a live demo where you can input sentences to analyze.
http://nlp.stanford.edu/sentiment/
I also saw this RelEx semantic dependency relationship extractor.
http://wiki.opencog.org/w/Sentence_algorithms
Some natural language understanding tools can analyze the meaning of sentences, including NLTK and Attempto Controlled English. There are several implementations of discourse representation structures and semantic parsers with a similar purpose.
There are also several parsers that can be used to generate a meaning representation from the text that is being parsed.

Natural language processing [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Question is maybe ( about 100%) subjective but I need advices. What is best language for natural language processing ? I know Java and C++ but is there easier way to do it. To be more specific I need to process texts from lot of sites and get information.
As I said in comments, the question is not about a language, but about suitable library. And there are a lot of NLP libraries in both Java and C++. I believe you must inspect some of them (in both languages) and then, when you will know all the plenty of available libraries, create some kind of "big plan", how to implement your task. So, here I'll just give you some links with a brief explanation what is what.
Java
GATE - it is exactly what its name means - General Architecture for Text Processing. Application in GATE is a pipeline. You put language processing resources like tokenizers, POS-taggers, morphological analyzers, etc. on it and run the process. The result is represented as a set of annotations - meta information, attached to a peace of text (e.g. token). In addition to great number of plugins (including plugins for integration with other NLP resources like WordNet or Stanford Parser), it has many predefined dictionaries (cities, names, etc.) and its own regex-like language JAPE. GATE comes with its own IDE (GATE Developer), where you can try your pipeline setup, and then save it and load from Java code.
UIMA - or Unstructured Information Management Applications. It is very similar to GATE in terms of architecture. It also represents pipeline and produces set of annotations. Like GATE, it has visual IDE, where you can try out your future application. The difference is that UIMA mostly concerns information extraction while GATE performs text processing without explicit consideration of its purpose. Also UIMA comes with simple REST server.
OpenNLP - they call themselves organization center for open source projects on NLP, and this is the most appropriate definition. Main direction of development is to use machine learning algorithms for the most general NLP tasks like part-of-speech tagging, named entity recognition, coreference resolution and so on. It also has good integration with UIMA, so its tools are also available.
Stanford NLP - probably best choice for engineers and researchers with NLP and ML knowledge. Unlike libraries like GATE and UIMA, it doesn't aim to provide as much tools as possible, but instead concentrates on idiomatic models. E.g. you don't have comprehensive dictionaries, but you can train probabilistic algorithm to create it! In addition to its CoreNLP component, that provides most wildly used tools like tokenization, POS tagging, NER, etc., it has several very interesting subprojects. E.g. their Dependency framework allows you to extract complete sentence structure. That is, you can, for example, easily extract information about subject and object of a verb in question, which is much harder using other NLP tools.
C++
UIMA - yes, there are complete implementations for both Java and C++.
Stanford Parser - some Stanford's projects are only in Java, others - only in C++, and some of them are available in both languages. You can find many of them here.
APIs
A number of web service APIs perform specific language processing, including:
Alchemy API - language identification, named entity recognition, sentiment analysis and much more! Take a look at their main page - it is quite self-descriptive.
OpenCalais - this service tries to build giant graph of everything. You pass it a web page URL and it enriches this page text with found entities, together with relations between them. For example, you pass it a page with "Steve Jobs" and it returns "Apple Inc." (roughly speaking) together with probability that this is the same Steve Jobs.
Other recommendations
And yes, you should definitely take a look at Python's NLTK. It is not only a powerful and easy-to-use NLP library, but also a part of excellent scientific stack created by extremely friendly community.
Update (2017-11-15): 7 years later there are even more impressive tools, cool algorithms and interesting tasks. One comprehensive description may be found here:
https://tomassetti.me/guide-natural-language-processing/
Python and NLTK
ScalaNLP, which is a Natural Language Processing library written in Scala, seems suitable for your job.
I would recommend Python and NLTK.
Some hints and notes I can pinpoint based on my experience using it:
Python has efficient list, strings handling. You can index lists very efficiently what in natural language should be a fact. Also has nice syntactic delicacies, for example to access the first 100 words of a list, you can index as list[:100] (compare it with stl in c++).
Python serialization is easy and native. The serialization modules make language processing corpus and text handling an easy task, one line of code.(compare it with the several lines using Boost or other libraries of C++)
NLTK provides classes for loading corpus, processing it, tagging, tokenization, grammars parsing, chunking, and a whole set of machine learning algorithms, among other stuff. Also it provides good resources for probabilistic models based on words distribution in text. http://www.nltk.org/book
If learning a new programming language is an obstacle, you can check openNLP for Java http://incubator.apache.org/opennlp/

Resources