How to use entitymentions annotator in stanford CoreNLP? - nlp

I am trying the newest version of Stanford CoreNLP. When I extract location or organisation names, I see that every single word is tagged with the annotation. So, if the entity is "NEW YORK TIMES", then it is getting recorded as three different entities : "NEW", "YORK" and "TIMES". I find that the newest CoreNLP have "entitymentions" annotator. I think this annotator may help me to solve this problem. However, there is no usage instruction or example for this annotator. Could anyone give me more info about this new feature?

Take a look at the mentions annotation key. This should be attached to a sentence, and contain a list of CoreMaps corresponding to each mention. So, there should be a CoreMap in there that corresponds to the mention of "New York Times".

I guess no annotator will annotate NEW YORK TIMES as a single entity, unless you train the model with such dataset.
Stanford NER and POS tagger is trained with some datasets, based on it it will annotate the entities. (I saw, it has some text dictionary list of people, location, organization in stanford source file. It would be deciding which entities to be annotated).
Trained dataset can annotate Newyork as a entity, if you want to annotate NEW YORK TIME as a entity then in that case you have to train with such datasets.
I tested with this annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref.
Query: New York Times is really nice.
Result : [Text=New CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP Lemma=New NamedEntityTag=ORGANIZATION] [Text=York CharacterOffsetBegin=4 CharacterOffsetEnd=8 PartOfSpeech=NNP Lemma=York NamedEntityTag=ORGANIZATION] [Text=Times CharacterOffsetBegin=9 CharacterOffsetEnd=14 PartOfSpeech=NNP Lemma=Times NamedEntityTag=ORGANIZATION] [Text=is CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VBZ Lemma=be NamedEntityTag=O] [Text=really CharacterOffsetBegin=18 CharacterOffsetEnd=24 PartOfSpeech=RB Lemma=really NamedEntityTag=O] [Text=nice CharacterOffsetBegin=25 CharacterOffsetEnd=29 PartOfSpeech=JJ Lemma=nice NamedEntityTag=O] [Text=. CharacterOffsetBegin=29 CharacterOffsetEnd=30 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Query: Newyork times
Result : [Text=Newyork CharacterOffsetBegin=0 CharacterOffsetEnd=7 PartOfSpeech=NNP Lemma=Newyork NamedEntityTag=LOCATION] [Text=times CharacterOffsetBegin=8 CharacterOffsetEnd=13 PartOfSpeech=NNS Lemma=time NamedEntityTag=O]

Integer entityMentionIndex = coreLabel.get(CoreAnnotations.EntityMentionIndexAnnotation.class);
If you try it with string "New York Times newspaper is distributed in California", you can see the entityMentionIndex is 0 (zero) for each word New, York and Times. That means if the index is same then those words are single entity.


compare documents using most similar method

I am able to build the model using the built-in lee_background corpus. But when I try to compare using most_similar method, I get an error.
lee_train_file = '/opt/conda/lib/python3.6/site-packages/gensim/test/test_data/lee_background.cor'
with open(lee_train_file) as f:
for i, line in enumerate(f):
train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i]))
model = gensim.models.doc2vec.Doc2Vec(vector_size=48, min_count=2, epochs=40)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
dummy text here...
inferred_vector=model.infer_vector(gensim.utils.simple_preprocess(line) )
model.docvecs.most_similar(inferred_vector, topn=3)
I tried this with list(inferred_vector) but still getting an error.
TypeError: 'numpy.float32' object is not iterable
I am trying to compare the dummy text with the corpus and find if the entry already exist in the data file.
Instead of list(inferred_vector) I need to use [inferred_vector]. This has solved my problem. But ever-time I run this code, I get different similar documents. How is this possible?
The national executive of the strife-torn Democrats last night appointed little-known West Australian senator Brian Greig
as interim leader--a shock move likely to provoke further conflict between the party's senators and its organisation.
In a move to reassert control over the party's seven senators, the national executive last night rejected Aden Ridgeway's
bid to become interim leader, in favour of Senator John, a supporter of deposed leader Natasha Stott Despoja and an outspoken
gay rights activist.
model.docvecs.most_similar([inferred_vector], topn=5)
Sometimes I get this list and the list keeps changing everytime I run the code even if there is no change in the model.
[(151, 0.5980586409568787),
(74, 0.5736572742462158),
(106, 0.5714541077613831),
(249, 0.5695925951004028),
(209, 0.5642371773719788)]
[(249, 0.5727256536483765),
(151, 0.5725511312484741),
(74, 0.5711895823478699),
(106, 0.5583171248435974),
(292, 0.5491517782211304)]
As a matter of fact, the first line in training corpus is 99% similar to this line because only 1 word is changed. Surprisingly the document_id 1 is nowhere in the top 5 list.
The dummy line should be selected from lee_background.cor and not from lee.cor
The model text will match with training corpus and not with test corpus.

Combining phrases from list of words Python3

doing my best to grab information out of a lot of pdf files. Have them in a dictionary format where the key is a given date and the values are a list of occupations.
looks like this when proper:
'12/29/2014': [['COUNSELING',
However, occasionally there are occupations with several words which cannot be reliably understood in single word-form, such as this:
'11/03/2014': [['DENTISTRY',
Notice that "osteopathic medicine & surgery" and "speech-language pathology" are the full text for two of these entries. This gets hairier when we also have examples of just "osteopathic medicine" or even "medicine."
So my question is this - How should I go about testing combinations of these words to see if they match more complex occupational titles? I can use the same order of the words, as I have maintained that from the source.

Tokenizer Training with StanfordNLP

So my requirement is verbally simple. I need StanfordCoreNLP default models along with my custom trained model, based on custom entities. In a final run, I need to be able to isolate specific phrases from a given sentence (RegexNER will be used)
Following are my efforts :-
So I wanted to use the StanfordCoreNLP CRF files, tagger files and ner model files, along with my custom trained ner models.
I tried to find if there is any official way of doing this, but didnt get anything. There is a property "ner.model" for StanfordCoreNLP pipeline, but it will skip the default ones if used.
Next (might not be the smartest thing ever. Sorry! Just a guy trying to make ends meet!) , I extracted the model stanford-corenlp-models-3.7.0.jar , and copied all :-
*.ser.gz (Parser Models)
*.tagger (POS Tagger)
*.crf.ser.gz (NER CRF Files)
and tried to put Comma Separated Values with properties "parser.model", "pos.model" and "ner.model" respectively, as follows :-
But, I get the following exception :-
Caused by: Error while loading a tagger model (probably missing model file)
Caused by: invalid stream header: EFBFBDEF
I thought I will be able to handle with RegexNER, and I was successful to some extent. Just that the entities that it learns through RegexNER, it doesn't apply to forthcoming expressions. Eg: It will find the entity "CUSTOM_ENTITY" inside a text, but if i put a RegexNER like ( [ {ner:CUSTOM_ENTITY} ] /with/ [ {ner:CUSTOM_ENTITY} ] ) it never succeeds in finding the right phrase.
Really need help here!!! I don't wanna train the complete model again, Stanford guys got over a GB of model information which is useful to me. Just that I want to add custom entities too.
First of all make sure your CLASSPATH has the proper jars in it.
Here is how you should include your custom trained NER model:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.model <csv-of-model-paths> -file example.txt
-ner.model should be set to a comma separated list of all models you want to use.
Here is an example of what you could put:
Note in my example that all of the standard models will be run, and then finally your custom model will be run. Make sure your custom model is in the CLASSPATH.
You also probably need to add this to your command: -ner.combinationMode HIGH_RECALL. By default the NER combination will only use the tags for a particular class from the first model. So if you have model1,model2,model3 only model1's LOCATION will be used. If you set things to HIGH_RECALL then model2 and model3's LOCATION tags will be used as well.
Another thing to keep in mind, model2 can't overwrite decisions by model1. It can only overwrite "O". So if model1 says that a particular token is a LOCATION, model2 can't say it's an ORGANIZATION or a PERSON or anything. So the order of the models in your list matters.
If you want to write rules that use entities found by previous rules, you should look at my answer to this question:
TokensRegex rules to get correct output for Named Entities
from your given context
use this instead of comma separated values and try to have all the jars within the same directory:
now copy the above lines,and similarly make the other models too and paste it in a file.
if u don't have file then create it.
and use the following command too start you server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -serverProperties

Can I use punctuation in Stanford CoreNLP Named Entities?

I'm trying to get Stanford Core NLP to recognise an identification code. The problem is the code has punctuation in it. e.g. 01.A01.01 which causes the input to be separated into three sentences.
The matching expression for this code would be [0-9][0-9][.][a-z,A-Z][0-9][0-9][.][0-9][0-9]. I've tried adding this into my regexner.txt file but it doesn't identify it (presumably because the tokens are across separate sentences?)
I've also tried to match it using a TokenRegex similar to the following (also without any success).
/tell/ /me/ /about/ (?$refCode /[0-9][0-9]/ /./ /[a-z,A-Z][0-9][0-9]/ /./ /[0-9][0-9]/ )
Some example uses...
The user has resource 02.G36.63 reserved.
Is 21.J83.02 available?
Does anyone have any ideas or suggestions?
I took your sample input and replaced "\n" with " ", to create:
The user has resource 02.G36.63 reserved. Is 21.J83.02 available?
I created this rules file (sample-rules.txt):
02.G36.63 ID_CODE MISC 2
And I ran this command:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -regexner.mapping sample-rules.txt -ssplit.eolonly -tokenize.whitespace -file sample-sentence.txt -outputFormat text
I got this output:
Sentence #1 (9 tokens):
The user has resource 02.G36.63 reserved. Is 21.J83.02 available?
[Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT Lemma=the NamedEntityTag=O]
[Text=user CharacterOffsetBegin=4 CharacterOffsetEnd=8 PartOfSpeech=NN Lemma=user NamedEntityTag=O]
[Text=has CharacterOffsetBegin=9 CharacterOffsetEnd=12 PartOfSpeech=VBZ Lemma=have NamedEntityTag=O]
[Text=resource CharacterOffsetBegin=13 CharacterOffsetEnd=21 PartOfSpeech=NN Lemma=resource NamedEntityTag=O]
[Text=02.G36.63 CharacterOffsetBegin=22 CharacterOffsetEnd=31 PartOfSpeech=NN Lemma=02.g36.63 NamedEntityTag=ID_CODE]
[Text=reserved. CharacterOffsetBegin=32 CharacterOffsetEnd=41 PartOfSpeech=NN Lemma=reserved. NamedEntityTag=O]
[Text=Is CharacterOffsetBegin=43 CharacterOffsetEnd=45 PartOfSpeech=VBZ Lemma=be NamedEntityTag=O]
[Text=21.J83.02 CharacterOffsetBegin=46 CharacterOffsetEnd=55 PartOfSpeech=NN Lemma=21.j83.02 NamedEntityTag=O]
[Text=available? CharacterOffsetBegin=56 CharacterOffsetEnd=66 PartOfSpeech=NN Lemma=available? NamedEntityTag=O]
This said to just tokenize on whitespace, so it stopped breaking on the periods. Also it said to only split sentences on newline, so it is important in the input file to put the entire user request on one line. You won't get sentences, but you can get a token stream and identify your product codes.
Now if you really want the full power of Stanford CoreNLP and you don't want to have these codes split, you could take the ambitious route and alter the tokenizer PTBLexer.flex file to include all of your id codes.
That file is here in the repo:
You'll have to Google around to find instructions on compiling the PTBLexer.flex file into This site should have the info you need:
This would basically mean adding in your id codes and making a few slight edits, and then rebuilding PTBLexer. Then with your custom tokenizer Stanford CoreNLP would treat your product codes like complete tokens and you could have normal sentence splitting if you want to do something like analyze the dependency structure of your user requests.

Can't Input Tab Delimited file to Stanford Classifier

I'm having a problem inputting tab delimited files into the stanford classifier.
Although I was able to successfully walk through all the included stanford tutorials, including the newsgroup tutorial, when I try to input my own training and test data it doesn't load properly.
At first I thought the problem was that I was saving the data into a tab delimited file using an Excel spreadsheet and it was some kind of encoding issue.
But then I got exactly the same results when I did the following. First I literally typed the demo data below into gedit, making sure to use a tab between the politics/sports class and the ensuing text:
politics Obama today announced a new immigration policy.
sports The NBA all-star game was last weekend.
politics Both parties are eyeing the next midterm elections.
politics Congress votes tomorrow on electoral reforms.
sports The Lakers lost again last night, 102-100.
politics The Supreme Court will rule on gay marriage this spring.
sports The Red Sox report to spring training in two weeks.
sports Messi set a world record for goals in a calendar year in 2012.
politics The Senate will vote on a new budget proposal next week.
politics The President declared on Friday that he will veto any budget that doesn't include revenue increases.
I saved that as myproject/demo-train.txt and a similar file as myproject/demo-test.txt.
I then ran the following:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt
The good news: this actually ran without throwing any errors.
The bad news: since it doesn't extract any features, it can't actually estimate a real model and the probability defaults to 1/n for each item, where n is the number of classes.
So then I ran the same command but with two basic options specified:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
That yielded:
Exception in thread "main" java.lang.RuntimeException: Training dataset could not be processed
at edu.stanford.nlp.classify.ColumnDataClassifier.readDataset(
at edu.stanford.nlp.classify.ColumnDataClassifier.readTrainingExamples (
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(
at edu.stanford.nlp.classify.ColumnDataClassifier.main(
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatumFromLine(
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(
... 3 more
These are exactly the same results I get when I used the real data I saved from Excel.
Even more though, I don't know how to make sense of the ArrayIndexOutOfBoundsException. When I used readline in python to print out the raw strings for both the demo files I created and the tutorial files that worked, nothing about the formatting seemed different. So I don't know why this exception would be raised with one set of files but not the other.
Finally, one other quirk. At one point I thought maybe line breaks were the problem. So I deleted all line breaks from the demo files while preserving tab breaks and ran the same command:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
Surprisingly, this time no java exceptions are thrown. But again, it's worthless: it treats the entire file as one observation, and can't properly fit a model as a result.
I've spent 8 hours on this now and have exhausted everything I can think of. I'm new to Java but I don't think that should be an issue here -- according to Stanford's API documentation for ColumnDataClassifier, all that's required is a tab delimited file.
Any help would be MUCH appreciated.
One last note: I've run these same commands with the same files on both Windows and Ubuntu, and the results are the same in each.
Use a properties file. In the example Stanford classifier example
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
The number 2 at the start of lines 3, 4 and 5 signifies the column in your tsv file. So in your case you would use
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
or if you want to run with command line arguments
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -1.useSplitWords =1.splitWordsRegexp "\s+"
I've faced the same error as you.
Pay attention on tabs in the text you are classifying.
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
This means, that at some point classifier expects array of 3 elements, after it splits the string with tabs.
I've run a method, that counts amount of tabs in each line, and if at some line you have not two of them - here is an error.
