details on the following Natural Language Processing terms? - nlp

Named Entity Extraction (extract ppl, cities, organizations)
Content Tagging (extract topic tags by scanning doc)
Structured Data Extraction
Topic Categorization (taxonomy classification by scanning doc....bayesian )
Text extraction (HTML page cleaning)
are there libraries that i can use to do any of the above functions of NLP ?
dont really feel like forking out cash to AlchemyAPI

There are actually plenty of freely available open-source natural language processing packages out there. Here's a brief list, organized by what language the toolkit is implemented in:
Python: Natural Language Toolkit NLTK
Java: OpenNLP, Gate, and Stanford's JavaNLP
.NET: Sharp NLP
If you're uncertain which one to go with, I would recommend starting with NLTK. The package is reasonably easy to use and has great documentation online, including a free book.
You should be able to use NLTK to easily accomplish the NLP tasks you've listed, e.g. named entity recognition (NER), extracting tags for documents, and document categorization.
What the Alchemy people call structured data extraction looks like it's just HTML scrapping that is robust against changes to the underlying HTML as long as the page still visually renders the same way. So, it's not really a NLP task.
For the extraction of text from HTML, just use boilerpipe. It's fast, good, and free.

The Apache UIMA project was originally created by IBM and provides an NLP framework much like GATE. There are various annotators out there that are built for UIMA.

Related

Natural Language Processing Libraries

I'm having a hard time figuring out what library and datasets go together.
Toolkits / Libraries I've found:
CoreNLP - Java
NLTK - Python
OpenNLP - Java
ClearNLP - Java
Out of all of these, some are missing features. For example OpenNLP didn't have a dependency parsing.
I need to find a library that's quick that will also do dependency parsing and part of speech tagging.
The next hurdle is where do we get data sets. I've found a lot of things out there, but nothing full and comprehensive.
Data I've found:
NLTK Corpora
English Web Treebank (looks to be the best but is paid)
OpenNLP
Penn Treebank
I'm confused as to what data sets I need for what features and what's actually available publicly. From my research is seems ClearNLP will work best for but has very little data.
Thank you
Stanford CoreNLP provides both POS tagging and dependency parsing out of the box (plus many other features!), it already has trained models so you don't need any data sets for it work!
Please let me know if you have any more questions about the toolkit!
http://nlp.stanford.edu/software/corenlp.shtml

Accuracy: ANNIE vs Stanford NLP vs OpenNLP with UIMA

My work is planning on using a UIMA cluster to run documents through to extract named entities and what not. As I understand it, UIMA have very few NLP components packaged with it. I've been testing GATE for awhile now and am fairly comfortable with it. It does ok on normal text, but when we run it through some representative test data, the accuracy drops way down. The text data we have internally is sometimes all caps, sometimes all lowercase, or a mix of the two in the same document. Even using ANNIE's all caps rules, the accuracy still leaves much to be desired. I've recently heard of Stanford NLP and OpenNLP but haven't had time to extensively train and test them. How do those two compare in terms of accuracy with ANNIE? Do they work with UIMA like GATE does?
Thanks in advance.
It's not possible/reasonable to give a general estimate on performance of these systems. As you said, on your test data the accuracy declines. That's for several reasons, one is the language characteristics of your documents, another is characteristics of the annotations you are expecting to see. Afaik for every NER task there are similar but still different annotation guidelines.
Having that said, on your questions:
ANNIE is the only free open source rule-based NER system in Java I could find. It's written for news articles and I guess tuned for the MUC 6 task. It's good for proof of concepts, but getting a bit outdated. Main advantage is that you can start improving it without any knowledge in machine learning, nlp, well maybe a little java. Just study JAPE and give it a shot.
OpenNLP, Stanford NLP, etc. come by default with models for news articles and perform (just looking at results, never tested them on a big corpus) better than ANNIE. I liked the Stanford parser better than OpenNLP, again just looking at documents, mostly news articles.
Without knowing what your documents look like I really can't say much more. You should decide if your data is suitable for rules or you go the machine learning way and use OpenNLP or Stanford parser or Illinois tagger or anything. The Stanford parser seems more appropriate for just pouring your data, training and producing results, while OpenNLP seems more appropriate for trying different algorithms, playing with parameters, etc.
For your GATE over UIMA dispute, I tried both and found more viral community and better documentation for GATE. Sorry for giving personal opinions :)
Just for the record answering the UIMA angle: For both Stanford NLP and OpenNLP, there is excellent packaging as UIMA analysis engines available via the DKPro Core project.
I would like to add one more note. UIMA and GATE are two frameworks for the creation of Natural Language Processing(NLP) applications. However, Name Entity Recognition (NER) is a basic NLP component and you can find an implementation of NER, independent of UIMA and GATE. The good news is you can usually find a wrapper for a decent NER in the UIMA and GATE. To make it clear let see this example:
OpenNLP NER
A wrapper for OpenNLP NER in GATE
A wrapper for OpenNLP NER in UIMA
It is the same for the Stanford NER component.
Coming back to your question, this website lists the state of the art NERs:
http://www.aclweb.org/aclwiki/index.php?title=Named_Entity_Recognition_(State_of_the_art)
For example, in the MUC-7 competition, best participant named LTG got the result with the accuracy of 93.39%.
http://www.aclweb.org/aclwiki/index.php?title=MUC-7_(State_of_the_art)
Note that if you want to use such a state of are implementation, you may have some issue with their license.

Natural language processing [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Question is maybe ( about 100%) subjective but I need advices. What is best language for natural language processing ? I know Java and C++ but is there easier way to do it. To be more specific I need to process texts from lot of sites and get information.
As I said in comments, the question is not about a language, but about suitable library. And there are a lot of NLP libraries in both Java and C++. I believe you must inspect some of them (in both languages) and then, when you will know all the plenty of available libraries, create some kind of "big plan", how to implement your task. So, here I'll just give you some links with a brief explanation what is what.
Java
GATE - it is exactly what its name means - General Architecture for Text Processing. Application in GATE is a pipeline. You put language processing resources like tokenizers, POS-taggers, morphological analyzers, etc. on it and run the process. The result is represented as a set of annotations - meta information, attached to a peace of text (e.g. token). In addition to great number of plugins (including plugins for integration with other NLP resources like WordNet or Stanford Parser), it has many predefined dictionaries (cities, names, etc.) and its own regex-like language JAPE. GATE comes with its own IDE (GATE Developer), where you can try your pipeline setup, and then save it and load from Java code.
UIMA - or Unstructured Information Management Applications. It is very similar to GATE in terms of architecture. It also represents pipeline and produces set of annotations. Like GATE, it has visual IDE, where you can try out your future application. The difference is that UIMA mostly concerns information extraction while GATE performs text processing without explicit consideration of its purpose. Also UIMA comes with simple REST server.
OpenNLP - they call themselves organization center for open source projects on NLP, and this is the most appropriate definition. Main direction of development is to use machine learning algorithms for the most general NLP tasks like part-of-speech tagging, named entity recognition, coreference resolution and so on. It also has good integration with UIMA, so its tools are also available.
Stanford NLP - probably best choice for engineers and researchers with NLP and ML knowledge. Unlike libraries like GATE and UIMA, it doesn't aim to provide as much tools as possible, but instead concentrates on idiomatic models. E.g. you don't have comprehensive dictionaries, but you can train probabilistic algorithm to create it! In addition to its CoreNLP component, that provides most wildly used tools like tokenization, POS tagging, NER, etc., it has several very interesting subprojects. E.g. their Dependency framework allows you to extract complete sentence structure. That is, you can, for example, easily extract information about subject and object of a verb in question, which is much harder using other NLP tools.
C++
UIMA - yes, there are complete implementations for both Java and C++.
Stanford Parser - some Stanford's projects are only in Java, others - only in C++, and some of them are available in both languages. You can find many of them here.
APIs
A number of web service APIs perform specific language processing, including:
Alchemy API - language identification, named entity recognition, sentiment analysis and much more! Take a look at their main page - it is quite self-descriptive.
OpenCalais - this service tries to build giant graph of everything. You pass it a web page URL and it enriches this page text with found entities, together with relations between them. For example, you pass it a page with "Steve Jobs" and it returns "Apple Inc." (roughly speaking) together with probability that this is the same Steve Jobs.
Other recommendations
And yes, you should definitely take a look at Python's NLTK. It is not only a powerful and easy-to-use NLP library, but also a part of excellent scientific stack created by extremely friendly community.
Update (2017-11-15): 7 years later there are even more impressive tools, cool algorithms and interesting tasks. One comprehensive description may be found here:
https://tomassetti.me/guide-natural-language-processing/
Python and NLTK
ScalaNLP, which is a Natural Language Processing library written in Scala, seems suitable for your job.
I would recommend Python and NLTK.
Some hints and notes I can pinpoint based on my experience using it:
Python has efficient list, strings handling. You can index lists very efficiently what in natural language should be a fact. Also has nice syntactic delicacies, for example to access the first 100 words of a list, you can index as list[:100] (compare it with stl in c++).
Python serialization is easy and native. The serialization modules make language processing corpus and text handling an easy task, one line of code.(compare it with the several lines using Boost or other libraries of C++)
NLTK provides classes for loading corpus, processing it, tagging, tokenization, grammars parsing, chunking, and a whole set of machine learning algorithms, among other stuff. Also it provides good resources for probabilistic models based on words distribution in text. http://www.nltk.org/book
If learning a new programming language is an obstacle, you can check openNLP for Java http://incubator.apache.org/opennlp/

Natural Language Processing Package

I have started working on a project which requires Natural Language Processing. We have do the spell checking as well as mapping sentences to phrases and their synonyms. I first thought of using GATE but i am confused on what to use? I found an interesting post here which got me even more confused.
http://lordpimpington.com/codespeaks/drupal-5.1/?q=node/5
Please help me decide on what suits my purpose the best. I am working a web application which will us this NLP tool as a service.
You didn't really give much info, but try this: http://www.nltk.org/
I don't think NLTK does spell checking (I could be wrong on this), but it can do parts of speech tagging for text input.
For finding/matching synonyms you could use something like WordNet http://wordnet.princeton.edu/
If you're doing something really domain specific: I would recommend coming up with your own ontology for domain specific terms.
If you are using Python you can develop a spell checker with Python Enchant.
NLTK is good for developing Sentiment Analysis system too. I have some prototypes of the same too
Jaggu
If you are using deep learning based models, and if you have sufficient data, you can implement task specific models for any purpose. With the development of deep leaning based languages models, you can used word embedding based models with lexicon resources to obtain synonyms and antonyms. You can also follow the links below to obtain more resources.
https://stanfordnlp.github.io/CoreNLP/
https://www.nltk.org/
https://wordnet.princeton.edu/

text mining library or lingual library?

i have a bunch of data harvested from a forum I own, and would like to do some text mining or use some linguistic library to extract useful information.
any text mining, data mining library in any language will do.
Thank you.
I recommend that you have a look at R. It has an extensive number of text mining packages: have a look at the Natural Language Processing view. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Another example of useful package for this is Gary King's readme package.
You may like to have a look at the Python NLTK (Natural Language ToolKit): it's specifically designed for this kind of thing.
There is also a great book you can but to get you started.
Mallet is a java library designed for text mining. Once you have preprocessed the text data, a general data mining tool like Weka would also suffice your task.
If you have access to SPSS or SAS, their products should be more easier to use.
Try GATE, it has GUI and of course you can use java api for more power:
http://gate.ac.uk/family/developer.html
You can also use Weka for processing text and doing text mining, have a look at these useful lectures:
http://sentimentmining.net/weka/
stanford core-nlp is good for English text, and has things like Named Entity Recognition. Take a look at: http://nlp.stanford.edu/software/corenlp.shtml
GATE, which Ehsan already recommended, is also good, but it can be a bit complicated if you need to write your own components. For large-scale stuff it's great though.
UIMA is similar to GATE, but not as easy to use because it doesn't feature an extensive GUI like GATE. (http://uima.apache.org)
I would recommend the following Python libraries:
nltk
keras
tensorflow
Note: Before any text analysis you should clean the data based on your requirement

Resources