Natural Language Processing Libraries - nlp

I'm having a hard time figuring out what library and datasets go together.
Toolkits / Libraries I've found:
CoreNLP - Java
NLTK - Python
OpenNLP - Java
ClearNLP - Java
Out of all of these, some are missing features. For example OpenNLP didn't have a dependency parsing.
I need to find a library that's quick that will also do dependency parsing and part of speech tagging.
The next hurdle is where do we get data sets. I've found a lot of things out there, but nothing full and comprehensive.
Data I've found:
NLTK Corpora
English Web Treebank (looks to be the best but is paid)
OpenNLP
Penn Treebank
I'm confused as to what data sets I need for what features and what's actually available publicly. From my research is seems ClearNLP will work best for but has very little data.
Thank you

Stanford CoreNLP provides both POS tagging and dependency parsing out of the box (plus many other features!), it already has trained models so you don't need any data sets for it work!
Please let me know if you have any more questions about the toolkit!
http://nlp.stanford.edu/software/corenlp.shtml

Related

How to Detect Present Simple Tense in English Sentences using a Rule-Based Approach?

I have a straightforward task of determining the sentence structure, specifically to identify if a sentence written in plain English is in the "present simple" tense. I am aware of a couple of libraries that could help with this task:
OpenNLP
CoreNLP
However, it seems that both of these libraries use machine learning in the background and require pre-trained language models. I am looking for a more lightweight solution, possibly using a rule-based approach. Is it possible to use OpenNLP or CoreNLP without machine learning for my task?

What's the difference between Stanford Tagger, Parser and CoreNLP?

I'm currently using different tools from Stanford NLP Group and trying to understand the differences between them. It seems to me that somehow they intersect each other, since I can use same features in different tools (e.g. tokenize, and POS-Tag a sentence can be done by Stanford POS-Tagger, Parser and CoreNLP).
I'd like to know what's the actual difference between each tool and in which situations I should use each of them.
All Java classes from the same release are the same, and, yes, they overlap. On a code basis, the parser and tagger are basically subsets of what is available in CoreNLP, except that they do have a couple of little add-ons of their own, such as the GUI for the parser. In terms of provided models, the parser and tagger come with models for a range of languages, whereas CoreNLP ships only with English out of the box. However, you can then download language-particular jars for CoreNLP which provide all the models we have for different languages. Anything that is available in any of the releases is present in the CoreNLP github site: https://github.com/stanfordnlp/CoreNLP

Accuracy: ANNIE vs Stanford NLP vs OpenNLP with UIMA

My work is planning on using a UIMA cluster to run documents through to extract named entities and what not. As I understand it, UIMA have very few NLP components packaged with it. I've been testing GATE for awhile now and am fairly comfortable with it. It does ok on normal text, but when we run it through some representative test data, the accuracy drops way down. The text data we have internally is sometimes all caps, sometimes all lowercase, or a mix of the two in the same document. Even using ANNIE's all caps rules, the accuracy still leaves much to be desired. I've recently heard of Stanford NLP and OpenNLP but haven't had time to extensively train and test them. How do those two compare in terms of accuracy with ANNIE? Do they work with UIMA like GATE does?
Thanks in advance.
It's not possible/reasonable to give a general estimate on performance of these systems. As you said, on your test data the accuracy declines. That's for several reasons, one is the language characteristics of your documents, another is characteristics of the annotations you are expecting to see. Afaik for every NER task there are similar but still different annotation guidelines.
Having that said, on your questions:
ANNIE is the only free open source rule-based NER system in Java I could find. It's written for news articles and I guess tuned for the MUC 6 task. It's good for proof of concepts, but getting a bit outdated. Main advantage is that you can start improving it without any knowledge in machine learning, nlp, well maybe a little java. Just study JAPE and give it a shot.
OpenNLP, Stanford NLP, etc. come by default with models for news articles and perform (just looking at results, never tested them on a big corpus) better than ANNIE. I liked the Stanford parser better than OpenNLP, again just looking at documents, mostly news articles.
Without knowing what your documents look like I really can't say much more. You should decide if your data is suitable for rules or you go the machine learning way and use OpenNLP or Stanford parser or Illinois tagger or anything. The Stanford parser seems more appropriate for just pouring your data, training and producing results, while OpenNLP seems more appropriate for trying different algorithms, playing with parameters, etc.
For your GATE over UIMA dispute, I tried both and found more viral community and better documentation for GATE. Sorry for giving personal opinions :)
Just for the record answering the UIMA angle: For both Stanford NLP and OpenNLP, there is excellent packaging as UIMA analysis engines available via the DKPro Core project.
I would like to add one more note. UIMA and GATE are two frameworks for the creation of Natural Language Processing(NLP) applications. However, Name Entity Recognition (NER) is a basic NLP component and you can find an implementation of NER, independent of UIMA and GATE. The good news is you can usually find a wrapper for a decent NER in the UIMA and GATE. To make it clear let see this example:
OpenNLP NER
A wrapper for OpenNLP NER in GATE
A wrapper for OpenNLP NER in UIMA
It is the same for the Stanford NER component.
Coming back to your question, this website lists the state of the art NERs:
http://www.aclweb.org/aclwiki/index.php?title=Named_Entity_Recognition_(State_of_the_art)
For example, in the MUC-7 competition, best participant named LTG got the result with the accuracy of 93.39%.
http://www.aclweb.org/aclwiki/index.php?title=MUC-7_(State_of_the_art)
Note that if you want to use such a state of are implementation, you may have some issue with their license.

details on the following Natural Language Processing terms?

Named Entity Extraction (extract ppl, cities, organizations)
Content Tagging (extract topic tags by scanning doc)
Structured Data Extraction
Topic Categorization (taxonomy classification by scanning doc....bayesian )
Text extraction (HTML page cleaning)
are there libraries that i can use to do any of the above functions of NLP ?
dont really feel like forking out cash to AlchemyAPI
There are actually plenty of freely available open-source natural language processing packages out there. Here's a brief list, organized by what language the toolkit is implemented in:
Python: Natural Language Toolkit NLTK
Java: OpenNLP, Gate, and Stanford's JavaNLP
.NET: Sharp NLP
If you're uncertain which one to go with, I would recommend starting with NLTK. The package is reasonably easy to use and has great documentation online, including a free book.
You should be able to use NLTK to easily accomplish the NLP tasks you've listed, e.g. named entity recognition (NER), extracting tags for documents, and document categorization.
What the Alchemy people call structured data extraction looks like it's just HTML scrapping that is robust against changes to the underlying HTML as long as the page still visually renders the same way. So, it's not really a NLP task.
For the extraction of text from HTML, just use boilerpipe. It's fast, good, and free.
The Apache UIMA project was originally created by IBM and provides an NLP framework much like GATE. There are various annotators out there that are built for UIMA.

Natural Language Processing Package

I have started working on a project which requires Natural Language Processing. We have do the spell checking as well as mapping sentences to phrases and their synonyms. I first thought of using GATE but i am confused on what to use? I found an interesting post here which got me even more confused.
http://lordpimpington.com/codespeaks/drupal-5.1/?q=node/5
Please help me decide on what suits my purpose the best. I am working a web application which will us this NLP tool as a service.
You didn't really give much info, but try this: http://www.nltk.org/
I don't think NLTK does spell checking (I could be wrong on this), but it can do parts of speech tagging for text input.
For finding/matching synonyms you could use something like WordNet http://wordnet.princeton.edu/
If you're doing something really domain specific: I would recommend coming up with your own ontology for domain specific terms.
If you are using Python you can develop a spell checker with Python Enchant.
NLTK is good for developing Sentiment Analysis system too. I have some prototypes of the same too
Jaggu
If you are using deep learning based models, and if you have sufficient data, you can implement task specific models for any purpose. With the development of deep leaning based languages models, you can used word embedding based models with lexicon resources to obtain synonyms and antonyms. You can also follow the links below to obtain more resources.
https://stanfordnlp.github.io/CoreNLP/
https://www.nltk.org/
https://wordnet.princeton.edu/

Resources