Part of speech tagging in OpenNLP vs. StanfordNLP - nlp

I'm new to part of speech (pos) taging and I'm doing a pos tagging on a text document. I'm considering using either OpenNLP or StanfordNLP for this. For StanfordNLP I'm using a MaxentTagger and I use english-left3words-distsim.tagger to train it. In OpenNLP I'm using POSModel and train it using en-pos-maxent.bin. How these two taggers (MaxentTagger and POSTagger) and the training sets (english-left3words-distsim.tagger and en-pos-maxent.bin) are different and which one is usually giving a better result.

Both POS taggers are based on Maximum Entropy machine learning. They differ in the parameters/features used to determine POS tags. For example, StanfordNLP pos tagger uses: "(i) more extensive treatment of capitalization for unknown words; (ii) features for the disambiguation of the tense forms of verbs; (iii) features for disambiguating particles from prepositions and adverbs" (read more in the paper). Features of OpenNLP are documented somewhere else which I currently don't know.
The models are probably trained on different corpora.
In general, it is really hard to tell which NLP tool performs better in term of quality. This is really dependent on your domain and you need to test your tools. See following papers for more information:
Is Part-Of-Tagging a Solved Task
Large Dataset for Keyphrases Extraction
In order to address this problem practically, I'm developing a Maven plugin and an annotation tool to create domain-specific NLP models more effectively.

Related

Named Entity Recognition Systems for German Texts

I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.
As a first step
are there German NER systems out there which I could use?
The further roadmap of my project is:
How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.
A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.
Thank you in advance for your inputs and fruitful discussions.
Most good NLP toolkits can perform NER in German:
Stanford NLP
Spacy
probably NLTK and OpenNLP as well
What is crucial to understand is that using NER software like the above means using a pretrained model, i.e. a model which has been previously trained on some standard corpus with standard annotated entities.
Btw you can usually find the original annotated dataset by looking at the documentation. There's one NER corpus here.
This is convenient and might suit your goal, but sometimes it doesn't collect exactly every that you would like it to collect, especially if your corpus is from a very specific domain. If you need more specific NER, you must train your own model and this requires obtaining some annotated data (i.e. manually annotating or paying somebody to do it).
Even in this case, a NER model is statistical and it will unavoidably make some mistakes, don't expect perfect results.
About misspellings or rare names: a NER model doesn't care (or not too much) about the actual entity, because it's not primarily based on the words in the entity. It's based on indications in the surrounding text, for example in the sentence "It was announced by Mr XYZ that the event would take place in July", the NER model should find 'Mr XYZ' as a person due to "announced by" and 'July' as a date because of "take place in". However if the language used in the corpus is very different from the training data used for the model, the performance could be very bad.

How to use dependency parsing features for text classification?

I did dependency parsing for a sentence using spacy and obtained syntactic dependency tags.
import spacy
nlp = spacy.load('en')
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
for token in doc:
print("{0}/{1} <--{2}-- {3}/{4}".format(
token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))
Output
Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--compound-- currencies/NNS
currencies/NNS <--pobj-- on/IN
I'm not unable to understand, how can I use this information to generate dependency-based features for text classification. What are the possible ways to generate features from this for text classification?
Thanks in advance............
In spaCy, there is currently no direct way to include the dependency features into the textcat component, unless you hack your way through the internals of the code.
In general, you'll have to think about what kind of features would be beneficial to give clues to your textcat algorithm. You could generate binary features for any possible "dependency path" in your data, such as "RB --advmod-- VBD" being one feature and then count how many times it occurs, but you'll very quickly have a very sparse dataset.
You may also be interested in other features like "what POS is the ROOT word" or does the sentence include patterns like "two nouns connected by a verb". But it really depends on the application.

What are some good tools/practises for aspect level sentiment analysis?

I am planning to get some review data from tripadvisor and I want to be able to extract hotel related aspects and assign polarity to them and classify them as negative or positive.
What tools can I use for this purpose and how and where do I start? I know there are some tools like GATE, Stanford NLP, Open NLP etc, but would I be able to perform the above specific tasks? If so, please let me know an approach to go forward. I am planning to use Java as the choice of programming language and would like to use some APIs
Also, should I go ahead with a rule based approach or a ML approach that uses a trained corpus of reviews, so some other approach completely?
P.S : I am new to NLP and I need some help to go forward.
Stanford CoreNLP has lot of features in one package
POS Tagger
NER Model
Sentiment Analysis
Parser
But in Apache OpenNLP package consist
Sentence Detector
POS tagger
NER
Chunker
But they don't have built in feature to find out Sentiment polarity So you have to pass your tags to other libraries such like SentiwordNet to find out the polarity.
I used used OpenNLP and Stanford Core NLP. But for both you need to modify sentiment corpus with respect to restaurant domain.
You can try ConceptNet (http://conceptnet5.media.mit.edu/). See for instance here (at the bottom of the page): https://github.com/commonsense/conceptnet5/wiki/API how to "see 20 things in English with the most positive affect:"

NLTK NER: Continuous Learning

I have been trying to use NER feature of NLTK. I want to extract such entities from the articles. I know that it can not be perfect in doing so but I wonder if there is human intervention in between to manually tag NEs, will it improve?
If yes, is it possible with present model in NLTK to continually train the model. (Semi-Supervised Training)
The plain vanilla NER chunker provided in nltk internally uses maximum entropy chunker trained on the ACE corpus. Hence it is not possible to identify dates or time, unless you train it with your own classifier and data(which is quite a meticulous job).
You could refer this link for performing he same.
Also, there is a module called timex in nltk_contrib which might help you with your needs.
If you are interested to perform the same in Java better look into Stanford SUTime, it is a part of Stanford CoreNLP.

Entity Recognition and Sentiment Analysis using NLP

So, this question might be a little naive, but I thought asking the friendly people of Stackoverflow wouldn't hurt.
My current company has been using a third party API for NLP for a while now. We basically URL encode a string and send it over, and they extract certain entities for us (we have a list of entities that we're looking for) and return a json mapping of entity : sentiment. We've recently decided to bring this project in house instead.
I've been studying NLTK, Stanford NLP and lingpipe for the past 2 days now, and can't figure out if I'm basically reinventing the wheel doing this project.
We already have massive tables containing the original unstructured text and another table containing the extracted entities from that text and their sentiment. The entities are single words. For example:
Unstructured text : Now for the bed. It wasn't the best.
Entity : Bed
Sentiment : Negative
I believe that implies we have training data (unstructured text) as well as entity and sentiments. Now how I can go about using this training data on one of the NLP frameworks and getting what we want? No clue. I've sort of got the steps, but not sure:
Tokenize sentences
Tokenize words
Find the noun in the sentence (POS tagging)
Find the sentiment of that sentence.
But that should fail for the case I mentioned above since it talks about the bed in 2 different sentences?
So the question - Does any one know what the best framework would be for accomplishing the above tasks, and any tutorials on the same (Note: I'm not asking for a solution). If you've done this stuff before, is this task too large to take on? I've looked up some commercial APIs but they're absurdly expensive to use (we're a tiny startup).
Thanks stackoverflow!
OpenNLP may also library to look at. At least they have a small tutuorial to train the name finder and to use the document categorizer to do sentiment analysis. To trtain the name finder you have to prepare training data by taging the entities in your text with SGML tags.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
NLTK provides a naive NER tagger along with resources. But It doesnt fit into all cases (including finding dates.) But NLTK allows you to modify and customize the NER Tagger according to the requirement. This link might give you some ideas with basic examples on how to customize. Also if you are comfortable with scala and functional programming this is one tool you cannot afford to miss.
Cheers...!
I have discovered spaCy lately and it's just great ! In the link you can find comparative for performance in term of speed and accuracy compared to NLTK, CoreNLP and it does really well !
Though to solve your problem task is not a matter of a framework. You can have two different system, one for NER and one for Sentiment and they can be completely independent. The hype these days is to use neural network and if you are willing too, you can train a recurrent neural network (which has showed best performance for NLP tasks) with attention mechanism to find the entity and the sentiment too.
There are great demo everywhere on the internet, the last two I have read and found interesting are [1] and [2].
Similar to Spacy, TextBlob is another fast and easy package that can accomplish many of these tasks.
I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc.
NLTK Tutorial: http://www.nltk.org/book/
Spacy Quickstart: https://spacy.io/usage/
Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html

Resources