Relationship Extraction using Stanford CoreNLP - nlp

I'm trying to extract information from natural language content using the Stanford CoreNLP library.
My goal is to extract "subject-action-object" pairs (simplified) from sentences.
As an example consider the following sentence:
John Smith only eats an apple and a banana for lunch. He's on a diet and his mother told him that it would be very healthy to eat less for lunch. John doesn't like it at all but since he's very serious with his diet, he doesn't want to stop.
From this sentence I would like to get results as followed:
John Smith - eats - only an apple and a banana for lunch
He - is - on a diet
His mother - told - him - that it would be very healthy to eat less for lunch
John - doesn't like - it (at all)
He - is - very serious with his diet
How would one do this?
Or to be more specific:
How can I parse a dependency tree (or a better-suited tree?) to obtain results as specified above?
Any hint, resource or code snippet given this task would be highly appreciated.
Side note:
I managed to replace coreferences with their representative mention which would then change the he and his to the corresponding entity (John Smith in that case).

The Stanford CoreNLP toolkit comes with a dependency parser.
First of all here is a link where the types of edges in tree are described:
http://universaldependencies.github.io/docs/
There are numerous ways you can use the toolkit to generate the dependency tree.
Here is some sample code to get you started:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
public class DependencyTreeExample {
public static void main (String[] args) throws IOException {
// set up properties
Properties props = new Properties();
props.setProperty("ssplit.eolonly","true");
props.setProperty("annotators",
"tokenize, ssplit, pos, depparse");
// set up pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// get contents from file
String content = new Scanner(new File(args[0])).useDelimiter("\\Z").next();
System.out.println(content);
// read in a product review per line
Annotation annotation = new Annotation(content);
pipeline.annotate(annotation);
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
System.out.println("---");
System.out.println("sentence: "+sentence);
SemanticGraph tree = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println(tree.toString(SemanticGraph.OutputFormat.READABLE));
}
}
}
instructions:
Cut and paste this into DependencyTreeExample.java
put that file in the directory stanford-corenlp-full-2015-04-20
javac -cp "*:." DependencyTreeExample.java
add your sentences one sentence per line to a file called dependency_sentences.txt
java -cp "*:." DependencyTreeExample dependency_sentences.txt
an example of output:
sentence: John doesn't like it at all.
dep reln gov
--- ---- ---
like-4 root root
John-1 nsubj like-4
does-2 aux like-4
n't-3 neg like-4
it-5 dobj like-4
at-6 case all-7
all-7 nmod:at like-4
.-8 punct like-4
This will print out the dependency parses. By working with the SemanticGraph object you can write code to find the kinds of patterns you want.
You'll note in this example "like" points to "John" with "nsubj" and "like" points to "it" with "dobj"
For reference you should look at edu.stanford.nlp.semgraph.SemanticGraph
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/SemanticGraph.html

You could also try out the new Stanford OpenIE system: http://nlp.stanford.edu/software/openie.shtml. In addition to the standalone download, it's now bundled in CoreNLP 3.6.0+.

Related

Replace specific text with a redacted version using Python

I am looking to do the opposite of what has been done here:
import re
text = '1234-5678-9101-1213 1415-1617-1819-hello'
re.sub(r"(\d{4}-){3}(?=\d{4})", "XXXX-XXXX-XXXX-", text)
output = 'XXXX-XXXX-XXXX-1213 1415-1617-1819-hello'
Partial replacement with re.sub()
My overall goal is to replace all XXXX within a text using a neural network. XXXX can represent names, places, numbers, dates, etc. that are in a .csv file.
The end result would look like:
XXXX went to XXXX XXXXXX
Sponge Bob went to Disney World.
In short, I am unmasking text and replacing it with a generated dataset using fuzzy.
You can do it using named-entity recognition (NER). It's fairly simple and there are out-of-the-shelf tools out there to do it, such as spaCy.
NER is an NLP task where a neural network (or other method) is trained to detect certain entities, such as names, places, dates and organizations.
Example:
Sponge Bob went to South beach, he payed a ticket of $200!
I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.
Returns:
Just be aware that this is not 100%!
Here are a little snippet for you to try out:
import spacy
phrases = ['Sponge Bob went to South beach, he payed a ticket of $200!', 'I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.']
nlp = spacy.load('en')
for phrase in phrases:
doc = nlp(phrase)
replaced = ""
for token in doc:
if token in doc.ents:
replaced+="XXXX "
else:
replaced+=token.text+" "
Read more here: https://spacy.io/usage/linguistic-features#named-entities
You could, instead of replacing with XXXX, replace based on the entity type, like:
if ent.label_ == "PERSON":
replaced += "<PERSON> "
Then:
import re, random
personames = ["Jack", "Mike", "Bob", "Dylan"]
phrase = re.replace("<PERSON>", random.choice(personames), phrase)

How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired"?

How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site."
here "Criticized Trump" is recognized as person instead of "Trump" as person.
How to pre-process and lower case the text like "Criticized" or "Texts" from the above string to overcome above issue or any other technique to do so.
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint
sent = ("F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site")
doc = nlp(sent)
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])
Result from above code:-
"Criticized Trump" as 'PERSON' and "Texts" as 'GPE'
Expected result should be:-
"Trump" as 'PERSON' instead of "Criticized Trump" as 'PERSON' and "Texts" as '' instead of "Texts" as 'GPE'
You can add more examples of Named Entities to tune the NER model. Here you have all the information needed for the preparation of train data https://spacy.io/usage/training. You can use prodigy (annotation tool from spaCy creators, https://prodi.gy) to mark Named Entities in your data.
Indeed, you can pre-process using POS tagging in order to change to lower case words like "Criticized" or "Texts" which are not proper nouns.
Proper capitalization (lower vs. upper case) will help the NER tagger.
sent = "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site"
doc = nlp(sent)
words = []
spaces = []
for a in doc:
if a.pos_ != 'PROPN':
words.append( a.text.lower() )
else:
words.append(a.text)
spaces.append(a.whitespace_)
spaces = [len(sp) for sp in spaces]
docNew = Doc(nlp.vocab, words=words, spaces=spaces)
print(docNew)
# F.B.I. Agent Peter Strzok, who criticized Trump in texts, is fired - the New York Times SectionsSEARCHSkip to contentskip to site

Sentiment-ranked nodes in dependency parse with Stanford CoreNLP?

I'd like to perform a dependency parse on a group of sentences and look at the sentiment ratings of individual nodes, as in the Stanford Sentiment Treebank (http://nlp.stanford.edu/sentiment/treebank.html).
I'm new to the CoreNLP API, and after fiddling around I still have no idea how I'd go about getting a dependency parse with ranked nodes. Is this even possible with CoreNLP, and if so, does anyone have experience doing it?
I modified the code of the inlcuded StanfordCoreNLPDemo.java file, to suit our sentiment needs:
Imports:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations;
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations.PredictedClass;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
Initializing the pipeline. Properties include lemma and sentiment:
public class StanfordCoreNlpDemo {
public static void main(String[] args) throws IOException {
PrintWriter out;
if (args.length > 1) {
out = new PrintWriter(args[1]);
} else {
out = new PrintWriter(System.out);
}
PrintWriter xmlOut = null;
if (args.length > 2) {
xmlOut = new PrintWriter(args[2]);
}
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, parse, sentiment");
props.setProperty("tokenize.options","normalizeCurrency=false");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Adding the text. These 3 sentences are taken from the live demo of the site you linked. I print the top level annotation's keys as well, to see what you can access from it:
Annotation annotation;
if (args.length > 0) {
annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
} else {
annotation = new Annotation("This movie doesn't care about cleverness, wit or any other kind of intelligent humor.Those who find ugly meanings in beautiful things are corrupt without being charming.There are slow and repetitive parts, but it has just enough spice to keep it interesting.");
}
pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, out);
if (xmlOut != null) {
pipeline.xmlPrint(annotation, xmlOut);
}
// An Annotation is a Map and you can get and use the various analyses individually.
// For instance, this gets the parse tree of the first sentence in the text.
out.println();
// The toString() method on an Annotation just prints the text of the Annotation
// But you can see what is in it with other methods like toShorterString()
out.println("The top level annotation's keys: ");
out.println(annotation.keySet());
For the first sentence, I print its keys and sentiment. Then, I iterate through all its nodes. For each one, i print the leaves of that subtree, which would be the part of the sentence this node is referring to, the name of the node, its sentiment, its node vector(I don't know what that is) and its predictions.
Sentiment is an integer, ranging from 0 to 4. 0 is very negative, 1 negative, 2 neutral, 3 positive and 4 very positive. Predictions is a vector of 4 values, each one including a percentage for how likely it is for that node to belong to each of the aforementioned classes. First value is for the very negative class, etc. The highest percentage is the node's sentiment.
Not all nodes of the annotated tree have sentiment. It seems that each word of the sentence has two nodes in the tree. You would expect words to be leaves, but they have a single child, which is a node with a label lacking the prediction annotation in its keys. The node's name is the same word.
That is why I check for the prediction annotation before I call the function, which fetches it. The correct way to do this, however, would be to ignore the null pointer exception thrown, but I chose to elaborate, to make the reader of this answer understand that no information regarding sentiment is missing.
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
if (sentences != null && sentences.size() > 0) {
ArrayCoreMap sentence = (ArrayCoreMap) sentences.get(0);
out.println("Sentence's keys: ");
out.println(sentence.keySet());
Tree tree2 = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);
out.println("Sentiment class name:");
out.println(sentence.get(SentimentCoreAnnotations.ClassName.class));
Iterator<Tree> it = tree2.iterator();
while(it.hasNext()){
Tree t = it.next();
out.println(t.yield());
out.println("nodestring:");
out.println(t.nodeString());
if(((CoreLabel) t.label()).containsKey(PredictedClass.class)){
out.println("Predicted Class: "+RNNCoreAnnotations.getPredictedClass(t));
}
out.println(RNNCoreAnnotations.getNodeVector(t));
out.println(RNNCoreAnnotations.getPredictions(t));
}
Lastly, some more output. Dependencies are printed. Dependencies here could be also accessed by accessors of the parse tree(tree or tree2):
out.println("The first sentence is:");
Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
out.println();
out.println("The first sentence tokens are:");
for (CoreMap token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
ArrayCoreMap aToken = (ArrayCoreMap) token;
out.println(aToken.keySet());
out.println(token.get(CoreAnnotations.LemmaAnnotation.class));
}
out.println("The first sentence parse tree is:");
tree.pennPrint(out);
tree2.pennPrint(out);
out.println("The first sentence basic dependencies are:");
out.println(sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class).toString(SemanticGraph.OutputFormat.LIST));
out.println("The first sentence collapsed, CC-processed dependencies are:");
SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
out.println(graph.toString(SemanticGraph.OutputFormat.LIST));
}
}
}

Training n-gram NER with Stanford NLP

Recently I have been trying to train n-gram entities with Stanford Core NLP. I have followed the following tutorials - http://nlp.stanford.edu/software/crf-faq.shtml#b
With this, I am able to specify only unigram tokens and the class it belongs to. Can any one guide me through so that I can extend it to n-grams. I am trying to extract known entities like movie names from chat data set.
Please guide me through in case I have mis-interpretted the Stanford Tutorials and the same can be used for the n-gram training.
What I am stuck with is the following property
#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1
Here the first column is the word (unigram) and the second column is the entity, for example
CHAPTER O
I O
Emma PERS
Woodhouse PERS
Now that I need to train known entities (say movie names) like Hulk, Titanic etc as movies, it would be easy with this approach. But in case I need to train I know what you did last summer or Baby's day out, what is the best approach ?
It had been a long wait here for an answer. I have not been able to figure out the way to get it done using Stanford Core. However mission accomplished. I have used the LingPipe NLP libraries for the same. Just quoting the answer here because, I think someone else could benefit from it.
Please check out the Lingpipe licencing before diving in for an implementation in case you are a developer or researcher or what ever.
Lingpipe provides various NER methods.
1) Dictionary Based NER
2) Statistical NER (HMM Based)
3) Rule Based NER etc.
I have used the Dictionary as well as the statistical approaches.
First one is a direct look up methodology and the second one being a training based.
An example for the dictionary based NER can be found here
The statstical approach requires a training file. I have used the file with the following format -
<root>
<s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX> to be trained</s>
...
<s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX> annotated </s>
</root>
I then used the following code to train the entities.
import java.io.File;
import java.io.IOException;
import com.aliasi.chunk.CharLmHmmChunker;
import com.aliasi.corpus.parsers.Muc6ChunkParser;
import com.aliasi.hmm.HmmCharLmEstimator;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.AbstractExternalizable;
#SuppressWarnings("deprecation")
public class TrainEntities {
static final int MAX_N_GRAM = 50;
static final int NUM_CHARS = 300;
static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior
public static void main(String[] args) throws IOException {
File corpusFile = new File("inputfile.txt");// my annotated file
File modelFile = new File("outputmodelfile.model");
System.out.println("Setting up Chunker Estimator");
TokenizerFactory factory
= IndoEuropeanTokenizerFactory.INSTANCE;
HmmCharLmEstimator hmmEstimator
= new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
CharLmHmmChunker chunkerEstimator
= new CharLmHmmChunker(factory,hmmEstimator);
System.out.println("Setting up Data Parser");
Muc6ChunkParser parser = new Muc6ChunkParser();
parser.setHandler( chunkerEstimator);
System.out.println("Training with Data from File=" + corpusFile);
parser.parse(corpusFile);
System.out.println("Compiling and Writing Model to File=" + modelFile);
AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
}
}
And to test the NER I used the following class
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Set;
import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.util.AbstractExternalizable;
public class Recognition {
public static void main(String[] args) throws Exception {
File modelFile = new File("outputmodelfile.model");
Chunker chunker = (Chunker) AbstractExternalizable
.readObject(modelFile);
String testString="my test string";
Chunking chunking = chunker.chunk(testString);
Set<Chunk> test = chunking.chunkSet();
for (Chunk c : test) {
System.out.println(testString + " : "
+ testString.substring(c.start(), c.end()) + " >> "
+ c.type());
}
}
}
Code Courtesy : Google :)
The answer is basically given in your quoted example, where "Emma Woodhouse" is a single name. The default models we supply use IO encoding, and assume that adjacent tokens of the same class are part of the same entity. In many circumstances, this is almost always true, and keeps the models simpler. However, if you don't want to do that you can train NER models with other label encodings, such as the commonly used IOB encoding, where you would instead label things:
Emma B-PERSON
Woodhouse I-PERSON
Then, adjacent tokens of the same category but not the same entity can be represented.
I faced the same challenge of tagging ngram phrases for automative domain.I was looking for an efficient keyword mapping that can be used to create training files at a later stage. I ended up using regexNER in the NLP pipeline, by providing a mapping file with the regular expressions (ngram component terms) and their corresponding label. Note that there is no NER machine learning achieved in this case. Hope this information helps someone!

Anaphora resolution using Stanford Coref

I have sentences (Text I):
Tom is a smart boy. He know a lot of thing.
I want to change He in the second sentence to Tom, so final sentences will become (Text II):
Tom is a smart boy. Tom know a lot of thing.
I've wrote some code, but my coref object always null.
Besides I have no idea what to do next to get correct result.
String text = "Tom is a smart boy. He know a lot of thing.";
Annotation document = new Annotation(text);
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, parse, lemma, ner, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
List<Pair<IntTuple, IntTuple>> coref = document.get(CorefGraphAnnotation.class);
I want to know if I'm doing it wrong and what I should do next to get Text II from Text I.
PS: I'm using Stanford CoreNLP 1.3.0.
Thanks.
List<Pair<IntTuple, IntTuple>> coref = document.get(CorefGraphAnnotation.class);
This is an old coref output format.
You can change this line to
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
or you can use the oldCorefFormat option:
props.put("oldCorefFormat", "true");

Resources