I have sentences (Text I):
Tom is a smart boy. He know a lot of thing.
I want to change He in the second sentence to Tom, so final sentences will become (Text II):
Tom is a smart boy. Tom know a lot of thing.
I've wrote some code, but my coref object always null.
Besides I have no idea what to do next to get correct result.
String text = "Tom is a smart boy. He know a lot of thing.";
Annotation document = new Annotation(text);
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, parse, lemma, ner, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
List<Pair<IntTuple, IntTuple>> coref = document.get(CorefGraphAnnotation.class);
I want to know if I'm doing it wrong and what I should do next to get Text II from Text I.
PS: I'm using Stanford CoreNLP 1.3.0.
Thanks.
List<Pair<IntTuple, IntTuple>> coref = document.get(CorefGraphAnnotation.class);
This is an old coref output format.
You can change this line to
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
or you can use the oldCorefFormat option:
props.put("oldCorefFormat", "true");
Related
How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site."
here "Criticized Trump" is recognized as person instead of "Trump" as person.
How to pre-process and lower case the text like "Criticized" or "Texts" from the above string to overcome above issue or any other technique to do so.
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint
sent = ("F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site")
doc = nlp(sent)
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])
Result from above code:-
"Criticized Trump" as 'PERSON' and "Texts" as 'GPE'
Expected result should be:-
"Trump" as 'PERSON' instead of "Criticized Trump" as 'PERSON' and "Texts" as '' instead of "Texts" as 'GPE'
You can add more examples of Named Entities to tune the NER model. Here you have all the information needed for the preparation of train data https://spacy.io/usage/training. You can use prodigy (annotation tool from spaCy creators, https://prodi.gy) to mark Named Entities in your data.
Indeed, you can pre-process using POS tagging in order to change to lower case words like "Criticized" or "Texts" which are not proper nouns.
Proper capitalization (lower vs. upper case) will help the NER tagger.
sent = "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site"
doc = nlp(sent)
words = []
spaces = []
for a in doc:
if a.pos_ != 'PROPN':
words.append( a.text.lower() )
else:
words.append(a.text)
spaces.append(a.whitespace_)
spaces = [len(sp) for sp in spaces]
docNew = Doc(nlp.vocab, words=words, spaces=spaces)
print(docNew)
# F.B.I. Agent Peter Strzok, who criticized Trump in texts, is fired - the New York Times SectionsSEARCHSkip to contentskip to site
In the DependencyParser.java repository, I can see it’s using recursive neural networks.
And from the open lecture (http://cs224d.stanford.edu), I learned that these networks calculate phrase vectors at each node of the parse tree.
I'm trying to make the Parser to output phrase vectors so that I can plot them on the 2-d plane but so far I haven't figured it out. - Can someone please point me to the java object and line numbers where they are calculated? (I suspect that they would be in line 765~)
private void setupClassifierForTraining(List<CoreMap> trainSents, List<DependencyTree> trainTrees, String embedFile, String preModel) {
double[][] E = new double[knownWords.size() + knownPos.size() + knownLabels.size()][config.embeddingSize];
double[][] W1 = new double[config.hiddenSize][config.embeddingSize * config.numTokens];
double[] b1 = new double[config.hiddenSize];
double[][] W2 = new double[system.numTransitions()][config.hiddenSize];
And if this is not the correct place to be looking for phrase vectors, I'd really appreciate it if you could point me to the code in the CoreNLP project I should be looking at.
Which lecture are you referring to?
This paper describes the neural network dependency parser we distribute:
http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf
I don't believe it creates phrase embeddings ; it creates embeddings for words, part-of-speech tags, and for dependency labels.
I'm trying to extract information from natural language content using the Stanford CoreNLP library.
My goal is to extract "subject-action-object" pairs (simplified) from sentences.
As an example consider the following sentence:
John Smith only eats an apple and a banana for lunch. He's on a diet and his mother told him that it would be very healthy to eat less for lunch. John doesn't like it at all but since he's very serious with his diet, he doesn't want to stop.
From this sentence I would like to get results as followed:
John Smith - eats - only an apple and a banana for lunch
He - is - on a diet
His mother - told - him - that it would be very healthy to eat less for lunch
John - doesn't like - it (at all)
He - is - very serious with his diet
How would one do this?
Or to be more specific:
How can I parse a dependency tree (or a better-suited tree?) to obtain results as specified above?
Any hint, resource or code snippet given this task would be highly appreciated.
Side note:
I managed to replace coreferences with their representative mention which would then change the he and his to the corresponding entity (John Smith in that case).
The Stanford CoreNLP toolkit comes with a dependency parser.
First of all here is a link where the types of edges in tree are described:
http://universaldependencies.github.io/docs/
There are numerous ways you can use the toolkit to generate the dependency tree.
Here is some sample code to get you started:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
public class DependencyTreeExample {
public static void main (String[] args) throws IOException {
// set up properties
Properties props = new Properties();
props.setProperty("ssplit.eolonly","true");
props.setProperty("annotators",
"tokenize, ssplit, pos, depparse");
// set up pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// get contents from file
String content = new Scanner(new File(args[0])).useDelimiter("\\Z").next();
System.out.println(content);
// read in a product review per line
Annotation annotation = new Annotation(content);
pipeline.annotate(annotation);
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
System.out.println("---");
System.out.println("sentence: "+sentence);
SemanticGraph tree = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println(tree.toString(SemanticGraph.OutputFormat.READABLE));
}
}
}
instructions:
Cut and paste this into DependencyTreeExample.java
put that file in the directory stanford-corenlp-full-2015-04-20
javac -cp "*:." DependencyTreeExample.java
add your sentences one sentence per line to a file called dependency_sentences.txt
java -cp "*:." DependencyTreeExample dependency_sentences.txt
an example of output:
sentence: John doesn't like it at all.
dep reln gov
--- ---- ---
like-4 root root
John-1 nsubj like-4
does-2 aux like-4
n't-3 neg like-4
it-5 dobj like-4
at-6 case all-7
all-7 nmod:at like-4
.-8 punct like-4
This will print out the dependency parses. By working with the SemanticGraph object you can write code to find the kinds of patterns you want.
You'll note in this example "like" points to "John" with "nsubj" and "like" points to "it" with "dobj"
For reference you should look at edu.stanford.nlp.semgraph.SemanticGraph
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/SemanticGraph.html
You could also try out the new Stanford OpenIE system: http://nlp.stanford.edu/software/openie.shtml. In addition to the standalone download, it's now bundled in CoreNLP 3.6.0+.
I am using stanford CoreNLP to try to find grammatical relations of noun phrases.
Here is an example:
Given the sentence "The fitness room was dirty."
I managed to identify "The fitness room" as my target noun phrase. I am now looking for a way to find that the "dirty" adjective has a relationship to "the fitness room" and not only to "room".
example code:
private static void doSentenceTest(){
Properties props = new Properties();
props.put("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP stanford = new StanfordCoreNLP(props);
TregexPattern npPattern = TregexPattern.compile("#NP");
String text = "The fitness room was dirty.";
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
stanford.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
Tree sentenceTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
TregexMatcher matcher = npPattern.matcher(sentenceTree);
while (matcher.find()) {
//this tree should contain "The fitness room"
Tree nounPhraseTree = matcher.getMatch();
//Question : how do I find that "dirty" has a relationship to the nounPhraseTree
}
// Output dependency tree
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(sentenceTree);
Collection<TypedDependency> tdl = gs.typedDependenciesCollapsed();
System.out.println("typedDependencies: "+tdl);
}
}
I used the Stanford CoreNLP on the sentence extracted the root Tree object of it. On this tree object I managed to extract Noun Phrases using a TregexPattern and a TregexMatcher. This gives me a child Tree that contains the actual noun phrase. What I would like to do know is find modifiers of the noun phrase in the original sentence.
The typedDependecies ouptut gives me the following :
typedDependencies: [det(room-3, The-1), nn(room-3, fitness-2), nsubj(dirty-5, room-3), cop(dirty-5, was-4), root(ROOT-0, dirty-5)]
where I can see nsubj(dirty-5, room-3) but I dont have the full noun phrase as dominator.
I hope I am clear enough.
Any help appreciated.
The typed dependencies do show that the adjective 'dirty' applies to 'the fitness room':
det(room-3, The-1)
nn(room-3, fitness-2)
nsubj(dirty-5, room-3)
cop(dirty-5, was-4)
root(ROOT-0, dirty-5)
the 'nn' tag is the noun compound modifier, indicating that 'fitness' is a modifier of 'room'.
You can find detailed information on the dependency tags in the Stanford dependency manual.
modify the method
Collection<TypedDependency> tdl = gs.typedDependenciesCollapsed(); with
Collection<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
or
Collection<TypedDependency> tdl = gs.allDependencies();
We are trying to use existing
tokenzation
sentence splitting
and named entity tagging
while we would like to use Stanford CoreNlp to additionally provide us with
part-of-speech tagging
lemmatization
and parsing
Currently, we are trying it the following way:
1) make an annotator for "pos, lemma, parse"
Properties pipelineProps = new Properties();
pipelineProps.put("annotators", "pos, lemma, parse");
pipelineProps.setProperty("parse.maxlen", "80");
pipelineProps.setProperty("pos.maxlen", "80");
StanfordCoreNLP pipeline = new StanfordCoreNLP(pipelineProps);
2) read in the sentences, with a custom method:
List<CoreMap> sentences = getSentencesForTaggedFile(idToDoc.get(docId));
within that method, the tokens are constructed the following way:
CoreLabel clToken = new CoreLabel();
clToken.setValue(stringToken);
clToken.setWord(stringToken);
clToken.setOriginalText(stringToken);
clToken.set(CoreAnnotations.NamedEntityTagAnnotation.class, neTag);
sentenceTokens.add(clToken);
and they are combined into sentences like this:
Annotation sentence = new Annotation(sb.toString());
sentence.set(CoreAnnotations.TokensAnnotation.class, sentenceTokens);
sentence.set(CoreAnnotations.TokenBeginAnnotation.class, tokenOffset);
tokenOffset += sentenceTokens.size();
sentence.set(CoreAnnotations.TokenEndAnnotation.class, tokenOffset);
sentence.set(CoreAnnotations.SentenceIndexAnnotation.class, sentences.size());
3) the list of sentences is passed to the pipeline:
Annotation document = new Annotation(sentences);
pipeline.annotate(document);
However, when running this, we get the following error:
null: InvocationTargetException: annotator "pos" requires annotator "tokenize"
Any pointers how we can achieve what we want to do?
The exception is thrown due to unsatisfied requirement expected by "pos" annotator (an instance of POSTaggerAnnotator class)
Requirements for annotators which StanfordCoreNLP knows how to create are defined in Annotator interface. For the case of "pos" annotator there are 2 requirements defined:
tokenize
ssplit
Both of this requirements needs to be satisfied, which means that both "tokenize" annotator and "ssplit" annotator must be specified in annotators list before "pos" annotator.
Now back to the question... If you like to skip "tokenize" and "ssplit" annotations in your pipeline you need to disable requirements check which is performed during initialization of the pipeline. I found two equivalent ways how this can be done:
Disable requirements enforcement in properties object passed to StanfordCoreNLP constructor:
props.setProperty("enforceRequirements", "false");
Set enforceRequirements parameter of StanfordCoreNLP constructor to false
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
you should add the parameters "tokenize"
pipelineProps.put("annotators", "tokenize, pos, lemma, parse");