Stanford CoreNLP: Use partial existing annotation - nlp

We are trying to use existing
tokenzation
sentence splitting
and named entity tagging
while we would like to use Stanford CoreNlp to additionally provide us with
part-of-speech tagging
lemmatization
and parsing
Currently, we are trying it the following way:
1) make an annotator for "pos, lemma, parse"
Properties pipelineProps = new Properties();
pipelineProps.put("annotators", "pos, lemma, parse");
pipelineProps.setProperty("parse.maxlen", "80");
pipelineProps.setProperty("pos.maxlen", "80");
StanfordCoreNLP pipeline = new StanfordCoreNLP(pipelineProps);
2) read in the sentences, with a custom method:
List<CoreMap> sentences = getSentencesForTaggedFile(idToDoc.get(docId));
within that method, the tokens are constructed the following way:
CoreLabel clToken = new CoreLabel();
clToken.setValue(stringToken);
clToken.setWord(stringToken);
clToken.setOriginalText(stringToken);
clToken.set(CoreAnnotations.NamedEntityTagAnnotation.class, neTag);
sentenceTokens.add(clToken);
and they are combined into sentences like this:
Annotation sentence = new Annotation(sb.toString());
sentence.set(CoreAnnotations.TokensAnnotation.class, sentenceTokens);
sentence.set(CoreAnnotations.TokenBeginAnnotation.class, tokenOffset);
tokenOffset += sentenceTokens.size();
sentence.set(CoreAnnotations.TokenEndAnnotation.class, tokenOffset);
sentence.set(CoreAnnotations.SentenceIndexAnnotation.class, sentences.size());
3) the list of sentences is passed to the pipeline:
Annotation document = new Annotation(sentences);
pipeline.annotate(document);
However, when running this, we get the following error:
null: InvocationTargetException: annotator "pos" requires annotator "tokenize"
Any pointers how we can achieve what we want to do?

The exception is thrown due to unsatisfied requirement expected by "pos" annotator (an instance of POSTaggerAnnotator class)
Requirements for annotators which StanfordCoreNLP knows how to create are defined in Annotator interface. For the case of "pos" annotator there are 2 requirements defined:
tokenize
ssplit
Both of this requirements needs to be satisfied, which means that both "tokenize" annotator and "ssplit" annotator must be specified in annotators list before "pos" annotator.
Now back to the question... If you like to skip "tokenize" and "ssplit" annotations in your pipeline you need to disable requirements check which is performed during initialization of the pipeline. I found two equivalent ways how this can be done:
Disable requirements enforcement in properties object passed to StanfordCoreNLP constructor:
props.setProperty("enforceRequirements", "false");
Set enforceRequirements parameter of StanfordCoreNLP constructor to false
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);

you should add the parameters "tokenize"
pipelineProps.put("annotators", "tokenize, pos, lemma, parse");

Related

How to add stop words from Tfidvectorizer?

I am trying to add stop words into my stop_word list, however, the code I am using doesn't seem to be working:
Creating stop words list:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords.extend(CustomListofWordstoExclude)
Here I am converting the text to a dtm (document term matrix) with tfidf weighting:
vect = TfidfVectorizer(stop_words = 'english', min_df=150, token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape
But when I do this, I get this error:
FutureWarning: Pass input=None as keyword args. From version 0.25 passing these as positional arguments will result in an error
warnings.warn("Pass {} as keyword args. From version 0.25 "
What does this mean? Is there an easier way to add stopwords?
I'm unable to reproduce the warning. However, note that a warning such as this does not mean that your code did not run as intended. It means that in future releases of the package it may not work as intended. So if you try the same thing next year with updated packages, it may not work.
With respect to your question about using stop words, there are two changes that need to be made for your code to work as you expect.
list.extend() extends the list in-place, but it doesn't return the list. To see this you can do type(stopwords1) which gives NoneType. To define a new variable and add the custom words list to stopwords in one line, you could just use the built-in + operator functionality for lists:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords + CustomListofWordstoExclude
To actually use stopwords1 as your new stopwords list when performing the TF-IDF vectorization, you need to pass stop_words=stopwords1:
vect = TfidfVectorizer(stop_words=stopwords1, # Passed stopwords1 here
min_df=150,
token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape

Customize spacy stop words and save the model

I am using this to add stopwords to the spacy's list of stopwords
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
However, when I save the nlp object using nlp.to_disk() and load it back again with nlp.from_disk(),
I am loosing the list of custom stop words.
Is there a way to save the custom stopwords with the nlp model?
Thanks in advance
Most language defaults (stop words, lexical attributes, and syntax iterators) are not saved with the model.
If you want to customize them, you can create a custom language class, see: https://spacy.io/usage/linguistic-features#language-subclass. An example copied from this link:
from spacy.lang.en import English
class CustomEnglishDefaults(English.Defaults):
stop_words = set(["custom", "stop"])
class CustomEnglish(English):
lang = "custom_en"
Defaults = CustomEnglishDefaults
nlp1 = English()
nlp2 = CustomEnglish()
print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])

Is there a way to turn off specific built-in tokenization rules in Spacy?

Spacy automatically tokenizes word contractions such as "dont" and "don't" into "do" and "nt"/"n't". For instance, a sentence like "I dont understand" would be tokenized into: ["I", "do", "nt", "understand"].
I understand this is usually helpful in many NLP tasks, but is there a way to suppress this special tokenization rule in Spacy such that the result is ["I", "dont", "understand"] instead?
This is because I am trying to evaluate the performance (f1-score for BIO tagging scheme) of my custom Spacy NER model, and the mismatch in the number of tokens in the input sentence and the number of predicated token tags is causing problems for my evaluation code down the line:
Input (3 tokens): [("I", "O"), ("dont", "O"), ("understand", "O")]
Predicted (4 tokens): [("I", "O"), ("do", "O"), ("nt", "O"), ("understand", "O")]
Of course, if anyone has any suggestions for a better way to perform evaluation on sequential tagging tasks in Spacy (perhaps like the seqeval package but more compatible with Spacy's token format), that would be greatly appreciated as well.
The special-case tokenization rules are defined in the tokenizer_exceptions.py in the respective language data (see here for the English "nt" contractions). When you create a new Tokenizer, those special case rules can be passed in via the rules argument.
Approach 1: Custom tokenizer with different special case rules
So one thing you could do for your use case is to reconstruct the English Tokenizer with the same prefix, suffix and infix rules, but with only a filtered set of tokenizer exceptions. Tokenizer exceptions are keyed by the string, so you could remove the entries for "dont" and whatever else you need. However, the code is quite verbose, since you're reconstructing the whole tokenizer:
from spacy.lang.en import English
from spacy.lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from spacy.lang.en import TOKENIZER_EXCEPTIONS
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex
prefix_re = compile_prefix_regex(TOKENIZER_PREFIXES).search
suffix_re = compile_suffix_regex(TOKENIZER_SUFFIXES).search
infix_re = compile_infix_regex(TOKENIZER_INFIXES).finditer
filtered_exc = {key: value for key, value in TOKENIZER_EXCEPTIONS.items() if key not in ["dont"]}
nlp = English()
tokenizer = Tokenizer(
nlp.vocab,
prefix_search=prefix_re,
suffix_search=suffix_re,
infix_finditer=infix_re,
rules=filtered_exc
)
nlp.tokenizer = tokenizer
doc = nlp("I dont understand")
Approach 2: Merge (or split) tokens afterwards
An alternative aproach would be to keep the tokenization as it is, but add rules on top that merge certain tokens back together aftwards to match the desired tokenization. This is obviously going to be slower at runtime, but it might be easier to implement and reason about, because you can approach it from the perspective of "Which tokens are currently separated but should be one?". For this, you could use the rule-based Matcher and the retokenizer to merge the matched tokens back together. As of spaCy v2.1, it also supports splitting, in case that's relevant.
from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English()
matcher = Matcher(nlp.vocab)
patterns = [[{"LOWER": "do"}, {"LOWER": "nt"}]]
matcher.add("TO_MERGE", None, *patterns)
doc = nlp("I dont understand")
matches = matcher(doc)
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
span = doc[start:end]
retokenizer.merge(span)
The above pattern would match two tokens (one dict per token), whose lowercase forms are "do" and "nt" (e.g. "DONT", "dont", "DoNt"). You can add more lists of dicts to the patterns to describe other sequences of tokens. For each match, you can then create a Span and merge it into one token. To make this logic more elegant, you could also wrap it as a custom pipeline component, so it's applied automatically when you call nlp on a text.

How can I find grammatical relations of a noun phrase using Stanford Parser or Stanford CoreNLP

I am using stanford CoreNLP to try to find grammatical relations of noun phrases.
Here is an example:
Given the sentence "The fitness room was dirty."
I managed to identify "The fitness room" as my target noun phrase. I am now looking for a way to find that the "dirty" adjective has a relationship to "the fitness room" and not only to "room".
example code:
private static void doSentenceTest(){
Properties props = new Properties();
props.put("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP stanford = new StanfordCoreNLP(props);
TregexPattern npPattern = TregexPattern.compile("#NP");
String text = "The fitness room was dirty.";
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
stanford.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
Tree sentenceTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
TregexMatcher matcher = npPattern.matcher(sentenceTree);
while (matcher.find()) {
//this tree should contain "The fitness room"
Tree nounPhraseTree = matcher.getMatch();
//Question : how do I find that "dirty" has a relationship to the nounPhraseTree
}
// Output dependency tree
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(sentenceTree);
Collection<TypedDependency> tdl = gs.typedDependenciesCollapsed();
System.out.println("typedDependencies: "+tdl);
}
}
I used the Stanford CoreNLP on the sentence extracted the root Tree object of it. On this tree object I managed to extract Noun Phrases using a TregexPattern and a TregexMatcher. This gives me a child Tree that contains the actual noun phrase. What I would like to do know is find modifiers of the noun phrase in the original sentence.
The typedDependecies ouptut gives me the following :
typedDependencies: [det(room-3, The-1), nn(room-3, fitness-2), nsubj(dirty-5, room-3), cop(dirty-5, was-4), root(ROOT-0, dirty-5)]
where I can see nsubj(dirty-5, room-3) but I dont have the full noun phrase as dominator.
I hope I am clear enough.
Any help appreciated.
The typed dependencies do show that the adjective 'dirty' applies to 'the fitness room':
det(room-3, The-1)
nn(room-3, fitness-2)
nsubj(dirty-5, room-3)
cop(dirty-5, was-4)
root(ROOT-0, dirty-5)
the 'nn' tag is the noun compound modifier, indicating that 'fitness' is a modifier of 'room'.
You can find detailed information on the dependency tags in the Stanford dependency manual.
modify the method
Collection<TypedDependency> tdl = gs.typedDependenciesCollapsed(); with
Collection<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
or
Collection<TypedDependency> tdl = gs.allDependencies();

Anaphora resolution using Stanford Coref

I have sentences (Text I):
Tom is a smart boy. He know a lot of thing.
I want to change He in the second sentence to Tom, so final sentences will become (Text II):
Tom is a smart boy. Tom know a lot of thing.
I've wrote some code, but my coref object always null.
Besides I have no idea what to do next to get correct result.
String text = "Tom is a smart boy. He know a lot of thing.";
Annotation document = new Annotation(text);
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, parse, lemma, ner, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
List<Pair<IntTuple, IntTuple>> coref = document.get(CorefGraphAnnotation.class);
I want to know if I'm doing it wrong and what I should do next to get Text II from Text I.
PS: I'm using Stanford CoreNLP 1.3.0.
Thanks.
List<Pair<IntTuple, IntTuple>> coref = document.get(CorefGraphAnnotation.class);
This is an old coref output format.
You can change this line to
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
or you can use the oldCorefFormat option:
props.put("oldCorefFormat", "true");

Resources