Ease of use: Stanford CoreNLP vs. OpenNLP [closed] - nlp

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I looking to use a suite of NLP tools for a personal project, and I was wondering whether Stanford's CoreNLP is easier to use or OpenNLP. Or is there another free package you would reccomend?
I haven't really done any NLP before, so I am looking for something that I can quickly use to learn the concepts and prototype my ideas. Any help is appreciated.

My opinion on which is easier to use is biased, but regarding Ivan Akcheurov's answer, we only released Stanford CoreNLP in Oct 2010, so it isn't very old. Regarding his suggestions, it seems to depend on whether you want to be using a higher-level processing framework or actual processing tools. E.g., if you poke around Knime, it appears that the only NLP components included are actually OpenNLP ones, and most of the machine learning is wrapping Weka.... For groups of individual tools that work together, Stanford NLP, OpenNLP, NLTK, and Lingpipe are perhaps the main choices.

I suggest you GATE (gate.ac.uk):
GATE
Language: Java
Has UIMA support integrartion
Documentation: Super great documented! Movie tutorials and Training Course
Has GUI
Ability to use WordNet, Lucene, Google, Yahoo, Google Translate, Weka
Has some parts of LingPipe and OpenNLP as a plugin
OpenNLP
Language: Java
SharpNLP (its C-Sharp port)
Has UIMA support integrartion
LingPipe
Language: Java
Documentation: Free book tutorials
NLTK
Language: Python
Documentation: an excellent free book
Corpora: Provides dozen of corpora data (~ 850 MB) and lexicons such as wordnet etc.

I suggest you Stanford as it provides the multiple things under one package that is opensource also e.g. Stanford CoreNLP has
StanFord Parser.
Stanford POS Tagger.
Stanford Named Entity Recognition.
Stanford Typed Dependencies. etc.
So in short under one umbrella you get multiple Solutions....

Related

Natural Language Processing books or resource for entry level person? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Can anyone gives some suggestions for good natural language processing book. Following are the factors I have in mind:
It gives a good overview of these huge topics without too much depth.
Concepts need to explain in picture form.
Sample code in JAVA/Python/R.
You can look at online courses about NLP. They oftain contain videos, exercices, writing documents, suggested readings...
I especially like this one : https://www.coursera.org/course/nlp (see suggested readings section for instance). You can access the lectures here : https://class.coursera.org/nlp/lecture (pdf + video + subtitles).
I believe there are three options for you--I wrote one of them so take this with a grain of salt.
1) Natural Language Processing with Python
by Steven Bird et al. http://amzn.com/0596516495. This book covers using the NLP api NLTK and is considered a solid book for intro to NLP. Lots of code, a more academic take on what NLP is and I assume broadly used in undergraduate NLP classes.
2) Natural Language Processing with Java by Richard Reese http://amzn.to/1D0liUY. This covers a range of APIs, including LingPipe below, and introduces NLP concepts and how they are implemented in a range of open source APIs. It is a more shallow dive into NLP but it is a gentler introduction and it covers how a bunch of APIs solve the same problem so it may help you pick what API to use.
3) Natural Language Processing with Java and LingPipe Cookbook by Breck Baldwin (me) and Krishna Dayanidhi http://amzn.to/1MvgHxa. This is meant for industrial programmers and it covers the concepts common in commercial NLP applications. The book is a much deeper dive into evaluation, problem specification, varied technologies that on the face do the same thing. But it expects you to learn from examples (overwhelmingly Twitter data).
All the books have lots of code, one in Python, the other two in Java. Both present mature APIs with a large installed base.
None of the books do much in the way of graphical explanation of what the software is doing.
Good luck

What do I need to know on NLP to be able to use and train Stanford NLP for intent analysis? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Any books, tutorials, course reccommedations would be much appreciated.
I need to know at what level I need to be regarding NLP to be able to comprehend the Stanford NLP and train it to customize it for my app of commercial sentiment analysis.
My goal is not a career in NLP or become an expert in NLP but only to be as much proficient to be able to understand and use the open source NLP frameworks properly and train them for my application.
For this level, what NLP study/training would be needed?
I'm learning c# and .net as well.
First: to simply use a sentiment model or train on existing data, there is not too much background to learn:
Tokenization
Constituency parsing, parse trees, etc.
Basic machine learning concepts (classification, cost functions, training / development sets, etc.)
These are well-documented ideas and are all a Google away. It might be worth it to skim the Coursera Natural Language Processing course (produced by people here at Stanford!) for the above ideas as well.
After that, the significant task is understanding how the RNTN sentiment model inside CoreNLP works. You don't need to grasp the math fully, I suppose, but the basic recursive nature of the algorithm is important to understand. The best resource is of course the original paper (and there's not much else, to be honest).
To train your own sentiment model, you'll need your own sentiment data. Producing this data is no small task. The data for the Stanford sentiment model was crowdsourced, and you may need to do something similar if you want to collect anything near the same scale.
The RNTN sentiment paper (linked above) gives some details on the data format. I'm happy to expand on this further if you do wish to create your own data.
I think you should simply comprehend the concept of supervised learning, unsupervised learning. In addition, some Java knowledge might be useful.

Best Open source / free NLP engine for the job [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Let's say that I have a pull (a list) of well known phrases, like:
{ "I love you", "Your mother is is a ...", "I think I am pregnant" ... } Let's say about a 1000 like these. And now I want the users to enter free text into a text box, and put some kind of NLP engine to digest the text and find the 10 most relevant phrases from the pull that may be related in a way to the text.
I thought that the simplest implementation could be looking by the words. Picking each time one word and looking for similarities in some way. Not sure which?
What most frightens me is the size of a vocabulary that I must support. I am a single developer of some kind of a demo, and I don't like the idea of filling in words into a table...
I am looking for a free NLP engine. I am agnostic about the language it's written in, but it must be free - NOT some kind of an online service that charges by API calls..
It seems that TextBlob and ConeptNet are more than adequate solution to this problem!
TextBlob is an easy-to-use NLP library for Python that is free and open source (licensed under the permissive MIT License). It provides a nice wrapper around the excellent NLTK and pattern libraries.
One simple approach to your problem would be to extract noun phrases from your given text.
Here's an example from the TextBlob docs.
from text.blob import TextBlob
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''
blob = TextBlob(text)
print(blob.noun_phrases)
# => ['titular threat', 'blob', 'ultimate movie monster', ...]
This could be a starting point. From there you could experiment with other methods, such as similarity methods as mentioned in the comments or TF-IDF. TextBlob also makes it easy to swap models for noun phrase extraction.
Full disclosure: I am the author of TextBlob.

Accuracy: ANNIE vs Stanford NLP vs OpenNLP with UIMA

My work is planning on using a UIMA cluster to run documents through to extract named entities and what not. As I understand it, UIMA have very few NLP components packaged with it. I've been testing GATE for awhile now and am fairly comfortable with it. It does ok on normal text, but when we run it through some representative test data, the accuracy drops way down. The text data we have internally is sometimes all caps, sometimes all lowercase, or a mix of the two in the same document. Even using ANNIE's all caps rules, the accuracy still leaves much to be desired. I've recently heard of Stanford NLP and OpenNLP but haven't had time to extensively train and test them. How do those two compare in terms of accuracy with ANNIE? Do they work with UIMA like GATE does?
Thanks in advance.
It's not possible/reasonable to give a general estimate on performance of these systems. As you said, on your test data the accuracy declines. That's for several reasons, one is the language characteristics of your documents, another is characteristics of the annotations you are expecting to see. Afaik for every NER task there are similar but still different annotation guidelines.
Having that said, on your questions:
ANNIE is the only free open source rule-based NER system in Java I could find. It's written for news articles and I guess tuned for the MUC 6 task. It's good for proof of concepts, but getting a bit outdated. Main advantage is that you can start improving it without any knowledge in machine learning, nlp, well maybe a little java. Just study JAPE and give it a shot.
OpenNLP, Stanford NLP, etc. come by default with models for news articles and perform (just looking at results, never tested them on a big corpus) better than ANNIE. I liked the Stanford parser better than OpenNLP, again just looking at documents, mostly news articles.
Without knowing what your documents look like I really can't say much more. You should decide if your data is suitable for rules or you go the machine learning way and use OpenNLP or Stanford parser or Illinois tagger or anything. The Stanford parser seems more appropriate for just pouring your data, training and producing results, while OpenNLP seems more appropriate for trying different algorithms, playing with parameters, etc.
For your GATE over UIMA dispute, I tried both and found more viral community and better documentation for GATE. Sorry for giving personal opinions :)
Just for the record answering the UIMA angle: For both Stanford NLP and OpenNLP, there is excellent packaging as UIMA analysis engines available via the DKPro Core project.
I would like to add one more note. UIMA and GATE are two frameworks for the creation of Natural Language Processing(NLP) applications. However, Name Entity Recognition (NER) is a basic NLP component and you can find an implementation of NER, independent of UIMA and GATE. The good news is you can usually find a wrapper for a decent NER in the UIMA and GATE. To make it clear let see this example:
OpenNLP NER
A wrapper for OpenNLP NER in GATE
A wrapper for OpenNLP NER in UIMA
It is the same for the Stanford NER component.
Coming back to your question, this website lists the state of the art NERs:
http://www.aclweb.org/aclwiki/index.php?title=Named_Entity_Recognition_(State_of_the_art)
For example, in the MUC-7 competition, best participant named LTG got the result with the accuracy of 93.39%.
http://www.aclweb.org/aclwiki/index.php?title=MUC-7_(State_of_the_art)
Note that if you want to use such a state of are implementation, you may have some issue with their license.

Natural language processing [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Question is maybe ( about 100%) subjective but I need advices. What is best language for natural language processing ? I know Java and C++ but is there easier way to do it. To be more specific I need to process texts from lot of sites and get information.
As I said in comments, the question is not about a language, but about suitable library. And there are a lot of NLP libraries in both Java and C++. I believe you must inspect some of them (in both languages) and then, when you will know all the plenty of available libraries, create some kind of "big plan", how to implement your task. So, here I'll just give you some links with a brief explanation what is what.
Java
GATE - it is exactly what its name means - General Architecture for Text Processing. Application in GATE is a pipeline. You put language processing resources like tokenizers, POS-taggers, morphological analyzers, etc. on it and run the process. The result is represented as a set of annotations - meta information, attached to a peace of text (e.g. token). In addition to great number of plugins (including plugins for integration with other NLP resources like WordNet or Stanford Parser), it has many predefined dictionaries (cities, names, etc.) and its own regex-like language JAPE. GATE comes with its own IDE (GATE Developer), where you can try your pipeline setup, and then save it and load from Java code.
UIMA - or Unstructured Information Management Applications. It is very similar to GATE in terms of architecture. It also represents pipeline and produces set of annotations. Like GATE, it has visual IDE, where you can try out your future application. The difference is that UIMA mostly concerns information extraction while GATE performs text processing without explicit consideration of its purpose. Also UIMA comes with simple REST server.
OpenNLP - they call themselves organization center for open source projects on NLP, and this is the most appropriate definition. Main direction of development is to use machine learning algorithms for the most general NLP tasks like part-of-speech tagging, named entity recognition, coreference resolution and so on. It also has good integration with UIMA, so its tools are also available.
Stanford NLP - probably best choice for engineers and researchers with NLP and ML knowledge. Unlike libraries like GATE and UIMA, it doesn't aim to provide as much tools as possible, but instead concentrates on idiomatic models. E.g. you don't have comprehensive dictionaries, but you can train probabilistic algorithm to create it! In addition to its CoreNLP component, that provides most wildly used tools like tokenization, POS tagging, NER, etc., it has several very interesting subprojects. E.g. their Dependency framework allows you to extract complete sentence structure. That is, you can, for example, easily extract information about subject and object of a verb in question, which is much harder using other NLP tools.
C++
UIMA - yes, there are complete implementations for both Java and C++.
Stanford Parser - some Stanford's projects are only in Java, others - only in C++, and some of them are available in both languages. You can find many of them here.
APIs
A number of web service APIs perform specific language processing, including:
Alchemy API - language identification, named entity recognition, sentiment analysis and much more! Take a look at their main page - it is quite self-descriptive.
OpenCalais - this service tries to build giant graph of everything. You pass it a web page URL and it enriches this page text with found entities, together with relations between them. For example, you pass it a page with "Steve Jobs" and it returns "Apple Inc." (roughly speaking) together with probability that this is the same Steve Jobs.
Other recommendations
And yes, you should definitely take a look at Python's NLTK. It is not only a powerful and easy-to-use NLP library, but also a part of excellent scientific stack created by extremely friendly community.
Update (2017-11-15): 7 years later there are even more impressive tools, cool algorithms and interesting tasks. One comprehensive description may be found here:
https://tomassetti.me/guide-natural-language-processing/
Python and NLTK
ScalaNLP, which is a Natural Language Processing library written in Scala, seems suitable for your job.
I would recommend Python and NLTK.
Some hints and notes I can pinpoint based on my experience using it:
Python has efficient list, strings handling. You can index lists very efficiently what in natural language should be a fact. Also has nice syntactic delicacies, for example to access the first 100 words of a list, you can index as list[:100] (compare it with stl in c++).
Python serialization is easy and native. The serialization modules make language processing corpus and text handling an easy task, one line of code.(compare it with the several lines using Boost or other libraries of C++)
NLTK provides classes for loading corpus, processing it, tagging, tokenization, grammars parsing, chunking, and a whole set of machine learning algorithms, among other stuff. Also it provides good resources for probabilistic models based on words distribution in text. http://www.nltk.org/book
If learning a new programming language is an obstacle, you can check openNLP for Java http://incubator.apache.org/opennlp/

Resources