How to suppress unmatched words in Stanford NER classifiers? - nlp

I am new to Stanford NLP and NER and trying to train a custom classifier with a data sets of currencies and countries.
My training data in training-data-currency.tsv looks like -
USD CURRENCY
GBP CURRENCY
And, training data in training-data-countries.tsv looks like -
USA COUNTRY
UK COUNTRY
And, classifiers properties look like -
trainFileList = classifiers/training-data-currency.tsv,classifiers/training-data-countries.tsv
ner.model=classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,classifiers/english.all.3class.distsim.crf.ser.gz
serializeTo = classifiers/my-classification-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
Java code to find the categories is -
LinkedHashMap<String, LinkedHashSet<String>> map = new<String, LinkedHashSet<String>> LinkedHashMap();
NERClassifierCombiner classifier = null;
try {
classifier = new NERClassifierCombiner(true, true,
"C:\\Users\\perso\\Downloads\\stanford-ner-2015-04-20\\stanford-ner-2015-04-20\\classifiers\\my-classification-model.ser.gz"
);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
List<List<CoreLabel>> classify = classifier.classify("Zambia");
for (List<CoreLabel> coreLabels : classify) {
for (CoreLabel coreLabel : coreLabels) {
String word = coreLabel.word();
String category = coreLabel
.get(CoreAnnotations.AnswerAnnotation.class);
if (!"O".equals(category)) {
if (map.containsKey(category)) {
map.get(category).add(word);
} else {
LinkedHashSet<String> temp = new LinkedHashSet<String>();
temp.add(word);
map.put(category, temp);
}
System.out.println(word + ":" + category);
}
}
}
When I run the above code with input as "USD" or "UK", I get expected result as "CURRENCY" or "COUNTRY". But, when I input something like "Russia", return value is "CURRENCY" which is from the first train file in the properties. I am expecting 'O' would be returned for these values which is not present in my training dat.
How can I achieve this behavior? Any pointers where I am going wrong would be really helpful.

Hi I'll try to help out!
So it sounds to me like you have a list of strings that should be called "CURRENCY", and you have a list of strings that should be called "COUNTRY", etc...
And you want something to tag strings based off of your list. So when you see "RUSSIA", you want it to be tagged "COUNTRY", when you see "USD", you want it to be tagged "CURRENCY".
I think these tools will be more helpful for you (particularly the first one):
http://nlp.stanford.edu/software/regexner/
http://nlp.stanford.edu/software/tokensregex.shtml
The NERClassifierCombiner is designed to train on large volumes of tagged sentences and look at a variety of features including the capitalization and the surrounding words to make a guess about a given word's NER label.
But it sounds to me in your case you just want to explicitly tag certain sequences based off of your pre-defined list. So I would explore the links I provided above.
Please let me know if you need any more help and I will be happy to follow up!

Related

Find all references to a supplied noun in StanfordNLP

I'm trying to parse some text to find all references to a particular item. So, for example, if my item was The Bridge on the River Kwai and I passed it this text, I'd like it to find all the instances I've put in bold.
The Bridge on the River Kwai is a 1957 British-American epic war film
directed by David Lean and starring William Holden, Jack Hawkins, Alec
Guinness, and Sessue Hayakawa. The film is a work of fiction, but
borrows the construction of the Burma Railway in 1942–1943 for its
historical setting. The movie was filmed in Ceylon (now Sri Lanka).
The bridge in the film was near Kitulgala.
So far my attempt has been to go through all the mentions attached to each CorefChain and loop through those hunting for my target string. If I find the target string, I add the whole CorefChain, as I think this means the other items in that CorefChain also refer to the same thing.
List<CorefChain> gotRefs = new ArrayList<CorefChain>();
String pQuery = "The Bridge on the River Kwai";
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder();
boolean addedChain = false;
for (CorefChain.CorefMention cm : corefMentions) {
if ((!addedChain) &&
(pQuery.equals(cm.mentionSpan))) {
gotRefs.add(cc);
addedChain = true;
}
}
}
I then loop through this second list of CorefChains, re-retrieve the mentions for each chain and step through them. In that loop I show which sentences have a likely mention of my item in a sentence.
for (CorefChain gr : gotRefs) {
List<CorefChain.CorefMention> corefMentionsUsing = gr.getMentionsInTextualOrder();
for (CorefChain.CorefMention cm : corefMentionsUsing) {
//System.out.println("Got reference to " + cm.mentionSpan + " in sentence #" + cm.sentNum);
}
}
It finds some of my references, but not that many, and it produces a lot of false positives. As might be entirely apparently from reading this, I don't really know the first thing about NLP - am I going about this entirely the wrong way? Is there a StanfordNLP parser that will already do some of what I'm after? Should I be training a model in some way?
I think a problem with your example is that you are looking for references to a movie title, and there isn't support in Stanford CoreNLP for recognizing movie titles, book titles, etc...
If you look at this example:
"Joe bought a laptop. He is happy with it."
You will notice that it connects:
"Joe" -> "He"
and
"a laptop" -> "it"
Coreference is an active research area and even the best system can only really be expected to produce an F1 of around 60.0 on general text, meaning it will often make errors.

Train model using Named entity

I am looking on standford corenlp using the Named Entity REcognizer.I have different kinds of input text and i need to tag it into my own Entity.So i started training my own model and it doesnt seems to be working.
For eg: my input text string is "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q"
I go through the examples to train my own models and and look for only some words that I am interested in.
My jane-austen-emma-ch1.tsv looks like this
Toyota PERS
Land Cruiser PERS
From the above input text i am only interested in those two words. The one is
Toyota and the other word is Land Cruiser.
The austin.prop look like this
trainFile = jane-austen-emma-ch1.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
Run the following command to generate the ner-model.ser.gz file
java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
public static void main(String[] args) {
String serializedClassifier = "edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz";
String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
try {
NERClassifierCombiner classifier = new NERClassifierCombiner(false, false,
serializedClassifier2,serializedClassifier);
String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
System.out.println("---");
List<List<CoreLabel>> out = classifier.classify(ss);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
}
System.out.println();
}
} catch (ClassCastException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Here is the output I am getting
Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/PERS 1956-1987/PERS Gold/O Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS
which i think its wrong.I am looking for Toyota/PERS and Land Cruiser/PERS(Which is a multi valued fied.
Thanks for the Help.Any help is really appreciated.
I believe you should also put examples of 0 entities in your trainFile. As you gave it, the trainFile is just too simple for the learning to be done, it needs both 0 and PERSON examples so it doesn't annotate everything as PERSON. You're not teaching it about your not-of-interest entities. Say, like this:
Toyota PERS
of 0
Portfolio 0
49 0
and so on.
Also, for phrase-level recognition you should look into regexner, where you can have patterns (patterns are good for us). I'm working on this with the API and I have the following code:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", customLocationFilename);
with the following customLocationFileName:
Make Believe Town figure of speech ORGANIZATION
( /Hello/ [{ ner:PERSON }]+ ) salut PERSON
Bachelor of (Arts|Laws|Science|Engineering) DEGREE
( /University/ /of/ [{ ner:LOCATION }] ) SCHOOL
and text: Hello Mary Keller was born on 4th of July and took a Bachelor of Science. Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to University of London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney Weaver) says they will pay this on the usual credit terms (30 days).
The output I get
Hello Mary Keller is a salut
4th of July is a DATE
Bachelor of Science is a DEGREE
$ 100,000 is a MONEY
40 % is a PERCENT
15th August is a DATE
University of London is a ORGANIZATION
Make Believe Town is a figure of speech
Sigourney Weaver is a PERSON
30 days is a DURATION
For more info on how to do this you can look at the example that got me going.
The NERClassifier* is word level, that is, it labels words, not phrases. Given that, the classifier seems to be performing fine. If you want, you can hyphenate words that form phrases. So in your labeled examples and in your test examples, you would make "Land Cruiser" to "Land_Cruiser".

Extract Noun phrase using stanford NLP

I am trying to find the Theme/Noun phrase from a sentence using Stanford NLP
For eg: the sentence "the white tiger" I would love to get
Theme/Nound phrase as : white tiger.
For this I used pos tagger. My sample code is below.
Result I am getting is "tiger" which is not correct. Sample code I used to run is
public static void main(String[] args) throws IOException {
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation("the white tiger)");
pipeline.annotate(annotation);
List<CoreMap> sentences = annotation
.get(CoreAnnotations.SentencesAnnotation.class);
System.out.println("the size of the senetence is......"
+ sentences.size());
for (CoreMap sentence : sentences) {
System.out.println("the senetence is..." + sentence.toString());
Tree tree = sentence.get(TreeAnnotation.class);
PrintWriter out = new PrintWriter(System.out);
out.println("The first sentence parsed is:");
tree.pennPrint(out);
System.out.println("does it comes here.....1111");
TregexPattern pattern = TregexPattern.compile("#NP");
TregexMatcher matcher = pattern.matcher(tree);
while (matcher.find()) {
Tree match = matcher.getMatch();
List<Tree> leaves1 = match.getChildrenAsList();
StringBuilder stringbuilder = new StringBuilder();
for (Tree tree1 : leaves1) {
String val = tree1.label().value();
if (val.equals("NN") || val.equals("NNS")
|| val.equals("NNP") || val.equals("NNPS")) {
Tree nn[] = tree1.children();
String ss = Sentence.listToString(nn[0].yield());
stringbuilder.append(ss).append(" ");
}
}
System.out.println("the final stringbilder is ...."
+ stringbuilder);
}
}
}
Any help is really appreciated.Any other thoughts to get this achieved.
It looks like you're descending the dependency trees looking for NN.*.
"white" is a JJ--an adjective--which won't be included searching for NN.*.
You should take a close look at the Stanford Dependencies Manual and decide what part of speech tags encompass what you're looking for. You should also look at real linguistic data to try to figure out what matters in the task you're trying to complete. What about:
the tiger [with the black one] [who was white]
Simply traversing the tree in that case will give you tiger black white. Exclude PP's? Then you lose lots of good info:
the tiger [with white fur]
I'm not sure what you're trying to accomplish, but make sure what you're trying to do is restricted in the right way.
You ought to polish up on your basic syntax as well. "the white tiger" is what linguists call a Noun Phrase or NP. You'd be hard pressed for a linguist to call an NP a sentence. There are also often many NPs inside a sentence; sometimes, they're even embedded inside one another. The Stanford Dependencies Manual is a good start. As in the name, the Stanford Dependencies are based on the idea of dependency grammar, though there are other approaches that bring different insights to the table.
Learning what linguists know about the structure of sentences could help you significantly in getting at what you're trying to extract or--as happens often--realizing that what you're trying to extract is too difficult and that you need to find a new route to a solution.

Mallet topic model - inconsistent results with serialized file

I train a topic model with Mallet, and I want to serialize it for later use. I ran it on two test documents, and then deserialized it and ran the loaded model on the same documents, and the results were completely different.
Is there anything wrong with the way I'm saving/loading the documents (code attached)?
Thanks!
List<Pipe> pipeList = initPipeList();
// Begin by importing documents from text to feature sequences
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
for (String document : documents) {
Instance inst = new Instance(document, "","","");
instances.addThruPipe(inst);
}
ParallelTopicModel model = new ParallelTopicModel(numTopics, alpha_t * numTopics, beta_w);
model.addInstances(instances);
model.setNumThreads(numThreads);
model.setNumIterations(numIterations);
model.estimate();
printProbabilities(model, "doc 1"); // I replaced the contents of the docs due to copywrite issues
printProbabilities(model, "doc 2");
model.write(new File("model.bin"));
model = ParallelTopicModel.read("model.bin");
printProbabilities(model, "doc 1");
printProbabilities(model, "doc 2");
Definition of printProbabilities():
public void printProbabilities(ParallelTopicModel model, String doc) {
List<Pipe> pipeList = initPipeList();
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
instances.addThruPipe(new Instance(doc, "", "", ""));
double[] probabilities = model.getInferencer().getSampledDistribution(instances.get(0), 10, 1, 5);
for (int i = 0; i < probabilities.length; i++) {
double probability = probabilities[i];
if (probability > 0.01) {
System.out.println("Topic " + i + ", probability: " + probability);
}
}
}
You have to use the same pipe for training and for classification. During traning, pipe's data alphabet gets updated with each training instance. You don't produce the same pipe using new SerialPipe(pipeList) as its data alphabet is empty. Save/load the pipe or instance list containing the pipe along with the model, and use that pipe to add test instances.
When you don't fix a random seed, every run of Mallet gives you a different topic model (with the numbers of the topics permuted, some topics slightly different, other topics very different).
Fix the random seed to get replicable topics.

Lucene wild card search

How can I perform a wildcard search in Lucene ?
I have the text: "1997_titanic"
If I search like "1997_titanic", it is returning a result, but I am not able to do below two searches:
1) If I search with only 1997 it is not returning any results.
2) Also if there is a space, such as in "spider man", that is not finding any results.
I retrieve all movie information from a DB and store it in Lucene Documents:
public Document createMovieDoc(Movie m){
document.add(new StoredField("moviename", m.getName()));
TextField field = new TextField("movienameSearch", m.getName().toLowerCase(), Store.NO);
field.setBoost(5.0f);
document.add(field);
}
And to search, I have this method:
public List searh(String txt){
PhraseQuery phQuery= new PhraseQuery();
Term term = new Term("movienameSearch", txt.toLowerCase());
BooleanQuery b = new BooleanQuery();
b.add(phQuery, Occur.SHOULD);
TopFieldDocs tp= searcher.search(b, 20, ..);
for(int i=0;i<tp.length;i++)
{
int mId = tp[i].doc;
Document d = searcher.doc(mId);
String moviename = d.get("moviename");
list.add(moviename);
}
return list;
}
I'm not sure what analyzer you are using to index. Sounds like maybe WhitespaceAnalyzer? It sounds like, when indexing "1997_titanic" remains a single token, while "spider man" is split into the token "spider" and "man".
Could also be SimpleAnalyzer which uses a LetterTokenizer. This would make it impossible to search for "1997", since that tokenizer will eliminate all numbers for the indexed representation of the text.
Your search method doesn't look right. You aren't adding any terms to your PhraseQuery, so I wouldn't expect it to find anything. You must add some terms in order for anything to be found. You create a Term in what you've provided, but nothing is ever done with that Term. Maybe this has something to do with how you've pick your excerpts, or something? Not sure, I'm a bit confused by that.
In order to manually construct a PhraseQuery you must add each term individually, so to search for "spider man", you would do something like:
PhraseQuery phQuery= new PhraseQuery();
phQuery.add(new Term("movienameSearch", "spider"));
phQuery.add(new Term("movienameSearch", "man"));
This requires you to know what the analyzer was doing at index time, and tokenize the input yourself to suit. The simpler solution is to just use the QueryParser:
//With whatever analyzer you like to use.
QueryParser parser = new QueryParser(Version.LUCENE_46, "defaultField", analyzer);
Query query = parser.parse("movienameSearch:\"" + txt.toLowerCase() + "\"");
TopFieldDocs tp= searcher.search(query, 20);
This allows you to rely on the same analyzer to index and query, so you don't have to know how to tokenize your phrases to suit.
As far as finding "1997" and "titanic" individually, I would recommend just using StandardAnalyzer. It will tokenize those into discrete tokens, allowing them to be searched very easily, with a simple query like: movienameSearch:1997.

Resources