Coreference Resolution using OpenNLP - nlp

I want to do "coreference resolution" using OpenNLP. Documentation from Apache (Coreference Resolution) doesn't cover how to do "coreference resolution". Does anybody have any docs/tutorial how to do this?

I recently ran into the same problem and wrote up some blog notes for using OpenNLP 1.5.x tools. It's a bit dense to copy in its entirety, so here's a link with more details.
At a high level, you need to load the appropriate OpenNLP coreference model libraries and also the WordNet 3.0 dictionary. Given those dependencies, initializing the linker object is pretty straightforward:
// LinkerMode should be TEST
//Note: I tried LinkerMode.EVAL before realizing that this was the problem
Linker _linker = new DefaultLinker("lib/opennlp/coref", LinkerMode.TEST);
Using the Linker, however, is a bit less obvious. You need to:
Break the content down into sentences and the corresponding tokens
Create a Parse object for each sentence
Wrap each sentence Parse so as to indicate the sentence ordering:
final DefaultParse parseWrapper = new DefaultParse(parse, idx);
Iterate over each sentence parse ane use the Linker to get the Mention objects from each parse:
final Mention[] extents =
_linker.getMentionFinder().getMentions(parseWrapper);
Finally, use the Linker to identify the distinct entities across all of the Mention objects:
DiscourseEntity[] entities = _linker.getEntities(arrayOfAllMentions);

There is little coreference resolution documentation for OpenNLP at the moment except for a very short mention of how to run it in the readme.
If you're not invested in using OpenNLP, then consider the Stanford CoreNLP package, which includes a Java example of how to run it, including how to perform coreference resolution using the package. It also includes a page summarizing it's performance, and the papers published on the coreference package.

Related

Natural Language Processing Libraries

I'm having a hard time figuring out what library and datasets go together.
Toolkits / Libraries I've found:
CoreNLP - Java
NLTK - Python
OpenNLP - Java
ClearNLP - Java
Out of all of these, some are missing features. For example OpenNLP didn't have a dependency parsing.
I need to find a library that's quick that will also do dependency parsing and part of speech tagging.
The next hurdle is where do we get data sets. I've found a lot of things out there, but nothing full and comprehensive.
Data I've found:
NLTK Corpora
English Web Treebank (looks to be the best but is paid)
OpenNLP
Penn Treebank
I'm confused as to what data sets I need for what features and what's actually available publicly. From my research is seems ClearNLP will work best for but has very little data.
Thank you
Stanford CoreNLP provides both POS tagging and dependency parsing out of the box (plus many other features!), it already has trained models so you don't need any data sets for it work!
Please let me know if you have any more questions about the toolkit!
http://nlp.stanford.edu/software/corenlp.shtml

Library to use for event extraction from text?

My project is to extract events automatically from a given text. The events can be written specifically, or just mentioned in a sentence, or not be there at all.
What is the library or technique that I should use for this? I tried the Stanford NER demo, but it gave bad results. I have enough time to explore and learn a library, so complexity is not a problem. Accuracy is a priority.
The task itself is a complex one, it is still not solve now(2018). But recently a very useful python library for nlp is emerging. It is Spacy, this lib has a relative higher performance than its competitors. With the library you can do things like tokenize,POS tagging,NER and sentence similarity。But you still need to utilize these features and extract events based on your specific rule.
Deeplearning may also be a way to try, but even though it works fine on other nlp tasks, it is still immature on event extraction.

What's the difference between Stanford Tagger, Parser and CoreNLP?

I'm currently using different tools from Stanford NLP Group and trying to understand the differences between them. It seems to me that somehow they intersect each other, since I can use same features in different tools (e.g. tokenize, and POS-Tag a sentence can be done by Stanford POS-Tagger, Parser and CoreNLP).
I'd like to know what's the actual difference between each tool and in which situations I should use each of them.
All Java classes from the same release are the same, and, yes, they overlap. On a code basis, the parser and tagger are basically subsets of what is available in CoreNLP, except that they do have a couple of little add-ons of their own, such as the GUI for the parser. In terms of provided models, the parser and tagger come with models for a range of languages, whereas CoreNLP ships only with English out of the box. However, you can then download language-particular jars for CoreNLP which provide all the models we have for different languages. Anything that is available in any of the releases is present in the CoreNLP github site: https://github.com/stanfordnlp/CoreNLP

How can one resolve synonyms in named-entity recognition?

In natural language processing, named-entity recognition is the challenge of, well, recognizing named entities such as organizations, places, and most importantly names.
There is a major challenge in this though that I call that of synonymy: The Count and Dracula are in fact referring to the same person, but it it possible that this is never discussed directly in the text.
What would be the best algorithm to resolve these synonyms?
If there is a feature for this in any Python-based library, I'm eager to be educated. I'm using NLTK.
You are describing a problem of coreference resolution and named entity linking. I'm providing separate links as I am not entirely sure which one you meant.
Coreference: Stanford CoreNLP currently has one of the best implementations, but is in Java. I have used the python bindings and I wasn't too happy- I ended up running all my data through the Stanford pipeline just once, and then loading the processed XML files in python. Obviously, that doesn't work if you have to be processing in real time.
Named entity linking: Check out Apache Stanbol and the links in the following Stackoverflow post.

Smalltalk and tf-idf algorithm

Can anyone show a simple implementation or usage example of a tf-idf algorithm in Smalltalk for natural language processing?
I've found an implementation in a package called NaturalSmalltalk, but it seems too complicated for my needs. A simple implementation in Python is like this one.
I've noticed there is another tf-idf in Hapax, but it seems related to analysis of software systems vocabularies, and I didn't found examples of how to use it.
I am the author of the original Hapax package for Visualworks. Hapax is a general purpose information retrieval package, it should be able to work with any kind of text files. I just happens so that I used to use it to analyze source code files.
The class that you are looking for is TermDocumentMatrix, there should be two methods globalWeighting: and localWeighting: to which you pass instances of InverseDocumentFrequency and either LogTermFrequency or TermFrequency depending on your needs. Typically when referring to tfidf people mean it to include logarithmic term frequencies.
There should best tests demonstrating the TDM class using a small example corpus. If the tests have not been ported to Squeak, please let me know so I can provide you with an example.
TextLint is a system based on PetitParser to parse and match patterns in natural language. It doens't provide what you ask for, but it shouldn't be too difficult to extend the model to compute word frequencies.

Resources