I am trying to find the Theme/Noun phrase from a sentence using Stanford NLP
For eg: the sentence "the white tiger" I would love to get
Theme/Nound phrase as : white tiger.
For this I used pos tagger. My sample code is below.
Result I am getting is "tiger" which is not correct. Sample code I used to run is
public static void main(String[] args) throws IOException {
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation("the white tiger)");
pipeline.annotate(annotation);
List<CoreMap> sentences = annotation
.get(CoreAnnotations.SentencesAnnotation.class);
System.out.println("the size of the senetence is......"
+ sentences.size());
for (CoreMap sentence : sentences) {
System.out.println("the senetence is..." + sentence.toString());
Tree tree = sentence.get(TreeAnnotation.class);
PrintWriter out = new PrintWriter(System.out);
out.println("The first sentence parsed is:");
tree.pennPrint(out);
System.out.println("does it comes here.....1111");
TregexPattern pattern = TregexPattern.compile("#NP");
TregexMatcher matcher = pattern.matcher(tree);
while (matcher.find()) {
Tree match = matcher.getMatch();
List<Tree> leaves1 = match.getChildrenAsList();
StringBuilder stringbuilder = new StringBuilder();
for (Tree tree1 : leaves1) {
String val = tree1.label().value();
if (val.equals("NN") || val.equals("NNS")
|| val.equals("NNP") || val.equals("NNPS")) {
Tree nn[] = tree1.children();
String ss = Sentence.listToString(nn[0].yield());
stringbuilder.append(ss).append(" ");
}
}
System.out.println("the final stringbilder is ...."
+ stringbuilder);
}
}
}
Any help is really appreciated.Any other thoughts to get this achieved.
It looks like you're descending the dependency trees looking for NN.*.
"white" is a JJ--an adjective--which won't be included searching for NN.*.
You should take a close look at the Stanford Dependencies Manual and decide what part of speech tags encompass what you're looking for. You should also look at real linguistic data to try to figure out what matters in the task you're trying to complete. What about:
the tiger [with the black one] [who was white]
Simply traversing the tree in that case will give you tiger black white. Exclude PP's? Then you lose lots of good info:
the tiger [with white fur]
I'm not sure what you're trying to accomplish, but make sure what you're trying to do is restricted in the right way.
You ought to polish up on your basic syntax as well. "the white tiger" is what linguists call a Noun Phrase or NP. You'd be hard pressed for a linguist to call an NP a sentence. There are also often many NPs inside a sentence; sometimes, they're even embedded inside one another. The Stanford Dependencies Manual is a good start. As in the name, the Stanford Dependencies are based on the idea of dependency grammar, though there are other approaches that bring different insights to the table.
Learning what linguists know about the structure of sentences could help you significantly in getting at what you're trying to extract or--as happens often--realizing that what you're trying to extract is too difficult and that you need to find a new route to a solution.
Related
I'm trying to parse some text to find all references to a particular item. So, for example, if my item was The Bridge on the River Kwai and I passed it this text, I'd like it to find all the instances I've put in bold.
The Bridge on the River Kwai is a 1957 British-American epic war film
directed by David Lean and starring William Holden, Jack Hawkins, Alec
Guinness, and Sessue Hayakawa. The film is a work of fiction, but
borrows the construction of the Burma Railway in 1942–1943 for its
historical setting. The movie was filmed in Ceylon (now Sri Lanka).
The bridge in the film was near Kitulgala.
So far my attempt has been to go through all the mentions attached to each CorefChain and loop through those hunting for my target string. If I find the target string, I add the whole CorefChain, as I think this means the other items in that CorefChain also refer to the same thing.
List<CorefChain> gotRefs = new ArrayList<CorefChain>();
String pQuery = "The Bridge on the River Kwai";
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder();
boolean addedChain = false;
for (CorefChain.CorefMention cm : corefMentions) {
if ((!addedChain) &&
(pQuery.equals(cm.mentionSpan))) {
gotRefs.add(cc);
addedChain = true;
}
}
}
I then loop through this second list of CorefChains, re-retrieve the mentions for each chain and step through them. In that loop I show which sentences have a likely mention of my item in a sentence.
for (CorefChain gr : gotRefs) {
List<CorefChain.CorefMention> corefMentionsUsing = gr.getMentionsInTextualOrder();
for (CorefChain.CorefMention cm : corefMentionsUsing) {
//System.out.println("Got reference to " + cm.mentionSpan + " in sentence #" + cm.sentNum);
}
}
It finds some of my references, but not that many, and it produces a lot of false positives. As might be entirely apparently from reading this, I don't really know the first thing about NLP - am I going about this entirely the wrong way? Is there a StanfordNLP parser that will already do some of what I'm after? Should I be training a model in some way?
I think a problem with your example is that you are looking for references to a movie title, and there isn't support in Stanford CoreNLP for recognizing movie titles, book titles, etc...
If you look at this example:
"Joe bought a laptop. He is happy with it."
You will notice that it connects:
"Joe" -> "He"
and
"a laptop" -> "it"
Coreference is an active research area and even the best system can only really be expected to produce an F1 of around 60.0 on general text, meaning it will often make errors.
I am doing some queries over a Lucene index, right now I'm looking for latin phrases over this queries. The problem is that some of this phrases include words that i think are consider like stoppers. For example if my search term is "a contrario sensu" the result is zero but if I only search for "contrario sensu" i have over 100 coincidences.
The question is how can i do a search without this stoppers?
My code looks like this
public IEnumerable<TesisIndx> Search(string searchTerm)
{
List<TesisIndx> results = new List<TesisIndx>();
IndexSearcher searcher = new IndexSearcher(FSDirectory.GetDirectory(indexPath));
QueryParser parser = new QueryParser("Rubro", analyzer);
PhraseQuery q = new PhraseQuery();
String[] words = searchTerm.Split(' ');
foreach (string word in words)
{
q.Add(new Term("Rubro", word));
}
//Query query = parser.Parse(searchTerm);
Hits hitsFound = searcher.Search(q);
TesisIndx sampleDataFileRow = null;
for (int i = 0; i < hitsFound.Length(); i++)
{
sampleDataFileRow = new TesisIndx();
Document doc = hitsFound.Doc(i);
sampleDataFileRow.Ius = int.Parse(doc.Get("Ius"));
sampleDataFileRow.Rubro = doc.Get("Rubro");
sampleDataFileRow.Texto = doc.Get("Texto");
results.Add(sampleDataFileRow);
}
}
I use a StandardAnalyzer to build the index and perform the search
It is a stop word. However, when it comes to phrase queries, that doesn't mean it isn't considered at all. If you try printing your query after parsing, you should see something like:
Rubro:"? contrario sensu"
That question mark represents a position increment, in this case a removed stop word. So it's looking for the phrase with a gap where a stop word has been removed at the beginning.
You can disable position increments in the query parser with QueryParser.setEnablePositionIncrements(false), though you should be aware this could cause problems for you if you still have position increments in the index, and run into a stop word in the middle of a phrase.
The StandardAnalyzer will exclude a set of stop words including "a" (see the end of https://github.com/apache/lucenenet/blob/3.0.3-2/src/core/Analysis/StopAnalyzer.cs for a full list)
It is important that the analysis style when querying is compatible with the style used when indexing. This is why your PhraseQuery only works without the "a" because the indexing step removed it.
You can use the StandardAnalyzer ctor that takes ISet<string> stopWords and pass in new HashSet<string>() Something like:
new StandardAnalyzer(Version.LUCENE_30, new HashSet<string>())
This means that all words will be included in the token stream for the field.
Use this analyzer when indexing and querying and you will get better results.
However, you should note that the StandardAnalyzer also fiddles with the words somewhat. It is designed to be "a good tokenizer for most European-language documents". See the comments at the beginning of https://github.com/apache/lucenenet/blob/3.0.3-2/src/core/Analysis/Standard/StandardTokenizer.cs for some more info and check if it's compatible with your use case.
It may be worth your time to investigate different analyzers for the kind of text you are indexing.
I am annotating and analyzing a series of text files.
The pipeline.annotate method becomes increasingly slow each time it reads a file. Eventually, I get an OutOfMemoryError.
Pipeline is initialized ONCE:
protected void initializeNlp()
{
Log.getLogger().debug("Starting Stanford NLP");
// creates a StanfordCoreNLP object, with POS tagging, lemmatization,
// NER, parsing, and
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner, depparse, natlog, openie");
props.put("regexner.mapping", namedEntityPropertiesPath);
pipeline = new StanfordCoreNLP(props);
Log.getLogger().debug("\n\n\nStarted Stanford NLP Successfully\n\n\n");
}
I then process each file using same instance of pipeline (as recommended elsewhere on SO and by Stanford).
public void processFile(Path file)
{
try
{
Instant start = Instant.now();
Annotation document = new Annotation(cleanString);
Log.getLogger().info("ANNOTATE");
pipeline.annotate(document);
Long millis= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Annotation Duration in millis: "+millis);
AnalyzedFile af = AnalyzedFileFactory.getAnalyzedFile(AnalyzedFileFactory.GENERIC_JOB_POST, file);
processSentences(af, document);
Log.getLogger().info("\n\n\nFile Processing Complete\n\n\n\n\n");
Long millis1= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Total Duration in millis: "+millis1);
allFiles.put(file.toUri().toString(), af);
}
catch (Exception e)
{
Log.getLogger().debug(e.getMessage(), e);
}
}
To be clear, I expect the problem is with my configuration. However, I am certain that the stall and memory issues occur at the pipeline.annotate(file) method.
I dispose of all references to Stanford-NLP objects other than pipeline (e.g., CoreLabel) after processing each file. That is, I do not keep references to any Stanford objects in my code beyond the method level.
Any tips or guidance would be deeply appreciated
OK, that last sentence of the question made me go double check. The answer is that I WAS keeping reference to CoreMap in one of my own classes. In other words, I was keeping in memory all the Trees, Tokens and other analyses for every sentence in my corpus.
In short, keep StanfordNLP CoreMaps for a given number of sentences and then dispose.
(I expect a hard core computational linguist would say there is rarely any need to keep a CoreMap once it has been analyzed, but I have to declare my neophyte status here)
The date detected from my following program gets split into two separate mentions whereas the detected date in the NER output of CoreNLP demo is single as it should be. What should I edit in my program to correct this.
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "This software was released on Februrary 5, 2015.";
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
List<CoreMap> mentions = sentence.get(MentionsAnnotation.class);
if (mentions != null) {
for (CoreMap mention : mentions) {
System.out.println("== Token=" + mention.get(TextAnnotation.class));
System.out.println("NER=" + mention.get(NamedEntityTagAnnotation.class));
System.out.println("Normalized NER=" + mention.get(NormalizedNamedEntityTagAnnotation.class));
}
}
}
Output from this program:
== Token=Februrary 5,
NER=DATE
Normalized NER=****0205
== Token=2015
NER=DATE
Normalized NER=2015
Output from CoreNLP online demo:
Note that the online demo is showing any sequence of consecutive tokens with the same NER tag as belonging to the same unit. Consider this sentence:
The event happened on February 5th January 9th.
This example yields "February 5th January 9th" as a single DATE in the online demo.
Yet it recognizes "February 5th" and "January 9th" as separate entity mentions.
Your sample code is looking at mentions, not NER chunks. Mentions are not being shown by the online demo.
That being said, I am not sure why SUTime is not joining February 5th and 2015 together in your example. Thanks for bringing this up, I will look into improving the module to fix this issue in future releases.
Does anyone know how to (using SimpleNLG) create a proper "two part" sentence like so (I'm not a linguist so I'm not exactly sure what syntactic categories each word/phrase:
"I bought a new widget engine, which created product A, product B, and product C."
The text in bold will be inserted dynamically at runtime by a syntactic parser or something else. I went through the SimpleNLG tutorial (there doesn't seem to be anything else that's more in-depth) and subsequently tried to attach a PPPhraseSpec object (representing the second part of the sentence above) to a SPhraseSpec (which has a nounphrase and verbphrase), but the result is incomprehensible and is grammatically incorrect.
Here is a solution for your problem: the first part in bold is a grammatical object (a noun phrase) and the second part in bold is an object to the verb "create" (coordinate clause).
import simplenlg.realiser.english.Realiser;
import simplenlg.lexicon.Lexicon;
import simplenlg.phrasespec.*;
import simplenlg.framework.*;
import simplenlg.features.*;
public class Test {
// Target:
// I bought a new widget engine,
// which created product A, product B, and product C.
public static void main(String[] args) {
System.out.println("Starting...");
Lexicon lexicon = Lexicon.getDefaultLexicon();
NLGFactory nlgFactory = new NLGFactory(lexicon);
Realiser realiser = new Realiser(lexicon);
SPhraseSpec s1 = nlgFactory.createClause("I",
"bought", "a new widget engine");
s1.setFeature(Feature.TENSE, Tense.PAST);
SPhraseSpec s2 = nlgFactory.createClause("", "created");
NPPhraseSpec object1 = nlgFactory.createNounPhrase("product A");
NPPhraseSpec object2 = nlgFactory.createNounPhrase("product B");
NPPhraseSpec object3 = nlgFactory.createNounPhrase("product C");
CoordinatedPhraseElement cc = nlgFactory.createCoordinatedPhrase();
cc.addCoordinate(object1);
cc.addCoordinate(object2);
cc.addCoordinate(object3);
s2.setObject(cc);
s2.setFeature(Feature.TENSE, Tense.PAST);
s2.setFeature(Feature.COMPLEMENTISER, ", which"); // non-restrictive?
s1.addComplement(s2);
String output = realiser.realiseSentence(s1);
System.out.println(output);
}
}
I am not a native English speaker and often get this part wrong, but I think you want a restrictive relative clause instead of a non-restrictive one: "I bought a new widget engine that created..." instead of "I bought a new widget engine, which created ..." If that is the case, just comment the line setting the complementiser.