I am looking on standford corenlp using the Named Entity REcognizer.I have different kinds of input text and i need to tag it into my own Entity.So i started training my own model and it doesnt seems to be working.
For eg: my input text string is "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q"
I go through the examples to train my own models and and look for only some words that I am interested in.
My jane-austen-emma-ch1.tsv looks like this
Toyota PERS
Land Cruiser PERS
From the above input text i am only interested in those two words. The one is
Toyota and the other word is Land Cruiser.
The austin.prop look like this
trainFile = jane-austen-emma-ch1.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
Run the following command to generate the ner-model.ser.gz file
java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
public static void main(String[] args) {
String serializedClassifier = "edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz";
String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
try {
NERClassifierCombiner classifier = new NERClassifierCombiner(false, false,
serializedClassifier2,serializedClassifier);
String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
System.out.println("---");
List<List<CoreLabel>> out = classifier.classify(ss);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
}
System.out.println();
}
} catch (ClassCastException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Here is the output I am getting
Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/PERS 1956-1987/PERS Gold/O Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS
which i think its wrong.I am looking for Toyota/PERS and Land Cruiser/PERS(Which is a multi valued fied.
Thanks for the Help.Any help is really appreciated.
I believe you should also put examples of 0 entities in your trainFile. As you gave it, the trainFile is just too simple for the learning to be done, it needs both 0 and PERSON examples so it doesn't annotate everything as PERSON. You're not teaching it about your not-of-interest entities. Say, like this:
Toyota PERS
of 0
Portfolio 0
49 0
and so on.
Also, for phrase-level recognition you should look into regexner, where you can have patterns (patterns are good for us). I'm working on this with the API and I have the following code:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", customLocationFilename);
with the following customLocationFileName:
Make Believe Town figure of speech ORGANIZATION
( /Hello/ [{ ner:PERSON }]+ ) salut PERSON
Bachelor of (Arts|Laws|Science|Engineering) DEGREE
( /University/ /of/ [{ ner:LOCATION }] ) SCHOOL
and text: Hello Mary Keller was born on 4th of July and took a Bachelor of Science. Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to University of London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney Weaver) says they will pay this on the usual credit terms (30 days).
The output I get
Hello Mary Keller is a salut
4th of July is a DATE
Bachelor of Science is a DEGREE
$ 100,000 is a MONEY
40 % is a PERCENT
15th August is a DATE
University of London is a ORGANIZATION
Make Believe Town is a figure of speech
Sigourney Weaver is a PERSON
30 days is a DURATION
For more info on how to do this you can look at the example that got me going.
The NERClassifier* is word level, that is, it labels words, not phrases. Given that, the classifier seems to be performing fine. If you want, you can hyphenate words that form phrases. So in your labeled examples and in your test examples, you would make "Land Cruiser" to "Land_Cruiser".
Related
I am writing a Bixby capsule and one of the inputs is street address.
One method that I have tried is creating the following structure:
structure (FullAddress) {
description (Address of a house)
property (addressNumber) {
type (geo.StreetNumber)
min (Required)
description (Address Number)
}
property (addressStreet) {
type (geo.StreetName)
min (Required)
description (Street Name)
}
property (addressSuffix) {
type (geo.StreetSuffix)
min (Required)
description (Street Name)
}
}
with a constructor action to put the 3 inputs together.
I have seen that given an address 19 Fake Fields Street the geo.StreetName typed input sometimes is able to understand Fake Fields and sometimes just Fake and drops Fields.
Also Bixby's speech to text sometimes hears app or have instead of ave for the geo.StreetSuffix value which makes it prompt the user for a suffix.
Is there a way to get Bixby to understand a street address with a little more accuracy?
Basically you need more training examples, which include 2 or 3 words as street names. Try to have at least 3 examples with xxx fakexxx fields street, and test in simulator the utterance yyy fakeyyy fields street to see if Bixby can capture fields as part of the address name. The goal here is to train Bixby to learn that there might be 2 or even 3 words ahead of addressSuffix. After that you can try utterance zzz fakezzz creek street without ever using creek in the training to confirm Bixby not just learned fields. Please read more in this article.
There is no easy way when come to speech recognition. You can include a vocab model to force "app" to be "ave", but what if user truly want say the word app or have? I would think the user can type ave or blvd, but need to say the word avenue instead of ave, and boulevard instead of blvd.
Another alternative is to use the viv.geo.SearchTerm in training and viv.geo.NamedPoint in your action. This let a user say something incomplete like "1 Market Street, California" and Bixby will use a HERE maps search to find this in San Francisco.
To use, setup a NamedPoint concept (after importing viv.geo)
structure (InputAddress) {
role-of (geo.NamedPoint)
}
Then in your action, you can do something like:
input (namedPoint) {
type (InputAddress)
min (Required) max (One)
default-select {
with-learning
with-rule {
select-first
}
}
}
In this example, using learning and select-first will automatically select the first address. Without this, Bixby will autosuggest addresses.
namedPoint will then be passed to your endpoint and you can parse as needed.
In training, use geo.SearchTerm - for example:
[g:GetAddressAction] My address is {[g:InputAddress] (665 Clyde Ave Mountain View California)[v:geo.SearchTerm]}
or for a prompt, you could use:
[g:GetAddressAction:continue:InputAddress] {[g:InputAddress] (60 S Market)[v:geo.SearchTerm]}
You can get a more fully formatted address by letting Bixby handle it by using the viv.geo.ResolveAddressByPlaceID goal. Here is a complete action using NamedPoint and ResolveAddressByPlaceID. Note the links to the relevant docs in comments
action (GetAddressAction) {
type(Search)
description (Get Address)
collect {
// See https://bixbydevelopers.com/dev/docs/dev-guide/developers/library.geo#using-searchterm - used in training
// and https://bixbydevelopers.com/dev/docs/dev-guide/developers/library.geo#namedpoint - used below and for computed-input
input (namedPoint) {
type (InputAddress)
min (Required) max (One)
default-select {
with-learning
with-rule {
select-first
}
}
// hidden - Hide if all you need is address
}
computed-input (address){
type (geo.Address)
min (Optional) max (One)
compute {
intent {
goal: viv.geo.ResolveAddressByPlaceID
value: $expr(namedPoint.placeID)
}
}
}
}
output (geo.Address)
}
I'm trying to parse some text to find all references to a particular item. So, for example, if my item was The Bridge on the River Kwai and I passed it this text, I'd like it to find all the instances I've put in bold.
The Bridge on the River Kwai is a 1957 British-American epic war film
directed by David Lean and starring William Holden, Jack Hawkins, Alec
Guinness, and Sessue Hayakawa. The film is a work of fiction, but
borrows the construction of the Burma Railway in 1942–1943 for its
historical setting. The movie was filmed in Ceylon (now Sri Lanka).
The bridge in the film was near Kitulgala.
So far my attempt has been to go through all the mentions attached to each CorefChain and loop through those hunting for my target string. If I find the target string, I add the whole CorefChain, as I think this means the other items in that CorefChain also refer to the same thing.
List<CorefChain> gotRefs = new ArrayList<CorefChain>();
String pQuery = "The Bridge on the River Kwai";
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder();
boolean addedChain = false;
for (CorefChain.CorefMention cm : corefMentions) {
if ((!addedChain) &&
(pQuery.equals(cm.mentionSpan))) {
gotRefs.add(cc);
addedChain = true;
}
}
}
I then loop through this second list of CorefChains, re-retrieve the mentions for each chain and step through them. In that loop I show which sentences have a likely mention of my item in a sentence.
for (CorefChain gr : gotRefs) {
List<CorefChain.CorefMention> corefMentionsUsing = gr.getMentionsInTextualOrder();
for (CorefChain.CorefMention cm : corefMentionsUsing) {
//System.out.println("Got reference to " + cm.mentionSpan + " in sentence #" + cm.sentNum);
}
}
It finds some of my references, but not that many, and it produces a lot of false positives. As might be entirely apparently from reading this, I don't really know the first thing about NLP - am I going about this entirely the wrong way? Is there a StanfordNLP parser that will already do some of what I'm after? Should I be training a model in some way?
I think a problem with your example is that you are looking for references to a movie title, and there isn't support in Stanford CoreNLP for recognizing movie titles, book titles, etc...
If you look at this example:
"Joe bought a laptop. He is happy with it."
You will notice that it connects:
"Joe" -> "He"
and
"a laptop" -> "it"
Coreference is an active research area and even the best system can only really be expected to produce an F1 of around 60.0 on general text, meaning it will often make errors.
The date detected from my following program gets split into two separate mentions whereas the detected date in the NER output of CoreNLP demo is single as it should be. What should I edit in my program to correct this.
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "This software was released on Februrary 5, 2015.";
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
List<CoreMap> mentions = sentence.get(MentionsAnnotation.class);
if (mentions != null) {
for (CoreMap mention : mentions) {
System.out.println("== Token=" + mention.get(TextAnnotation.class));
System.out.println("NER=" + mention.get(NamedEntityTagAnnotation.class));
System.out.println("Normalized NER=" + mention.get(NormalizedNamedEntityTagAnnotation.class));
}
}
}
Output from this program:
== Token=Februrary 5,
NER=DATE
Normalized NER=****0205
== Token=2015
NER=DATE
Normalized NER=2015
Output from CoreNLP online demo:
Note that the online demo is showing any sequence of consecutive tokens with the same NER tag as belonging to the same unit. Consider this sentence:
The event happened on February 5th January 9th.
This example yields "February 5th January 9th" as a single DATE in the online demo.
Yet it recognizes "February 5th" and "January 9th" as separate entity mentions.
Your sample code is looking at mentions, not NER chunks. Mentions are not being shown by the online demo.
That being said, I am not sure why SUTime is not joining February 5th and 2015 together in your example. Thanks for bringing this up, I will look into improving the module to fix this issue in future releases.
I am new to Stanford NLP and NER and trying to train a custom classifier with a data sets of currencies and countries.
My training data in training-data-currency.tsv looks like -
USD CURRENCY
GBP CURRENCY
And, training data in training-data-countries.tsv looks like -
USA COUNTRY
UK COUNTRY
And, classifiers properties look like -
trainFileList = classifiers/training-data-currency.tsv,classifiers/training-data-countries.tsv
ner.model=classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,classifiers/english.all.3class.distsim.crf.ser.gz
serializeTo = classifiers/my-classification-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
Java code to find the categories is -
LinkedHashMap<String, LinkedHashSet<String>> map = new<String, LinkedHashSet<String>> LinkedHashMap();
NERClassifierCombiner classifier = null;
try {
classifier = new NERClassifierCombiner(true, true,
"C:\\Users\\perso\\Downloads\\stanford-ner-2015-04-20\\stanford-ner-2015-04-20\\classifiers\\my-classification-model.ser.gz"
);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
List<List<CoreLabel>> classify = classifier.classify("Zambia");
for (List<CoreLabel> coreLabels : classify) {
for (CoreLabel coreLabel : coreLabels) {
String word = coreLabel.word();
String category = coreLabel
.get(CoreAnnotations.AnswerAnnotation.class);
if (!"O".equals(category)) {
if (map.containsKey(category)) {
map.get(category).add(word);
} else {
LinkedHashSet<String> temp = new LinkedHashSet<String>();
temp.add(word);
map.put(category, temp);
}
System.out.println(word + ":" + category);
}
}
}
When I run the above code with input as "USD" or "UK", I get expected result as "CURRENCY" or "COUNTRY". But, when I input something like "Russia", return value is "CURRENCY" which is from the first train file in the properties. I am expecting 'O' would be returned for these values which is not present in my training dat.
How can I achieve this behavior? Any pointers where I am going wrong would be really helpful.
Hi I'll try to help out!
So it sounds to me like you have a list of strings that should be called "CURRENCY", and you have a list of strings that should be called "COUNTRY", etc...
And you want something to tag strings based off of your list. So when you see "RUSSIA", you want it to be tagged "COUNTRY", when you see "USD", you want it to be tagged "CURRENCY".
I think these tools will be more helpful for you (particularly the first one):
http://nlp.stanford.edu/software/regexner/
http://nlp.stanford.edu/software/tokensregex.shtml
The NERClassifierCombiner is designed to train on large volumes of tagged sentences and look at a variety of features including the capitalization and the surrounding words to make a guess about a given word's NER label.
But it sounds to me in your case you just want to explicitly tag certain sequences based off of your pre-defined list. So I would explore the links I provided above.
Please let me know if you need any more help and I will be happy to follow up!
I have an application that is using the new Bing API from the Azure Datamarketplace. I used to be able to query the Bing API using a simple syntax using OR AND, etc. That doesn't seem to work in the new API.
Old syntax:
"Jacksonville Jaguars" OR "NFL Jaguars" OR "Atlanta Falcons"
That would give me a query with any of those phrases (I am making a rt_Sports query for news).
I am calling HttpEncode on the query first, but am still not getting results. It works if I remove all the " marks, but then I am getting results sometimes for news about Falcons and Jaguars (the animals)... Not what I wanted.
Anyone have any idea how you can form a query that takes multiple phrases?
I have tried to not use the OR, not use the ', use a ", use the | instead of OR. All of these work against BING the website, just not in the API.
I just tried this via Bing and got 36 Million results:
NFL Football | Seattle Seahawks | New York Giants | Dallas Cowboys | New Orleans Saints | New England Patriots | Jacksonville Jaguars
Same thing in the API returns 0.
I got an email from a friend who I also emailed this question out to and his thought was that I was going about it wrong. That there should be a way to form a LINQ query off the Bing object with multiple where clauses.
But I don't see how that would be possible. You allow a BingSearchContainer and then call the News method on the container. The News method has only a single Query parameter.
var bingContainer = new Bing.BingSearchContainer(new Uri("https://api.datamarket.azure.com/Bing/Search"));
bingContainer.Credentials = new NetworkCredential(App.BingAPIAccountKey, App.BingAPIAccountKey);
string unifiedQuery = "NFL Football | Jacksonville Jaguars | Atlanta Falcons";
var newsQuery = bingContainer.News(unifiedQuery, null, "en-US", "Strict", null, null, null, "rt_Sports", "Relevance");
newsQuery.BeginExecute(BingNewsResultLoadedCallback, newsQuery);
Try changing unifiedQuery to the following:
var unifiedQuery = "'NFL Football' or 'Jacksonville Jaguars' or 'Atlanta Falcons'";
I tried something very similar to your sample code, using this format for the query string, and it worked for me:
var bingUri = new Uri("https://api.datamarket.azure.com/Bing/Search/v1/", UriKind.Absolute);
var bingContainer = new BingSearchContainer(bingUri);
bingContainer.Credentials = new NetworkCredential(BingAPIUserName, BingAPIAccountKey);
var unifiedQuery = "'NFL Football' or 'Jacksonville Jaguars' or 'Atlanta Falcons'";
var newsQuery = bingContainer.News(unifiedQuery, null, "en-US", "Strict", null, null, null, "rt_Sports", "Relevance");
var results = newsQuery.Execute();
foreach (var item in results)
{
Console.WriteLine(item.Title);
}
Here are my results:
Fantasy Football 2012: Ranking the Top 25 RBs
NFL Football No Longer Just a Sunday Game
Ravens Notebook: Ed Reed decided to play in game vs. Falcons since he 'wasn't doing anything else'
PrimeSport Partners with Jacksonville Jaguars to Offer Tickets and Official Fan
Packages for all Home and Away Games in 2012 Season
Jaguars cut former Ravens wide receiver Lee Evans
Falcons left tackle Baker finally feels healthy
Jaguars release veteran WR Lee Evans
NFC West: 2012 NFL Training Camp
Atlanta Falcons 2012 NFL TV Schedule
Jaguars training camp: Veteran WR Lee Evans released
Jaguars score 18 points in second half to beat Giants 32-31
Jacksonville Jaguars put running back Maurice Jones-Drew on reserve/did not report list
Postcard from camp: Falcons
Questions abound as NFL preseason opens in earnest
NFL fantasy football: Ryan Mathews loses value
The format for the unifiedQuery string is basically the OData URI query string format. For a full description of how these query strings work, check out the OData URI conventions documentation at http://www.odata.org/documentation/uri-conventions.