Stanford NLP: set RegexNERAnnotator to caseInsensitive

Stanford NLP: set RegexNERAnnotator to caseInsensitive - nlp

I am identifying qualifications in a large corpus. I am using NamedEntityTagAnnotation.
Problem:
My annotations are read in as case sensitive. I want them to be case insensitive.
Hence
Bachelor's Degree DEGREE
does not need an additional entry of
Bachelor's degree DEGREE
I know this is possible. RegexNERAnnotator has a field for ignoreCase. But I don't know how to access RegexNERAnnotator through the API.
My current code (which I cadged off the internet and works apart from the case issue) is as follows:
String prevNeToken = "O";
String currNeToken = "O";
boolean newToken = true;
for (CoreLabel token : sentence.get(TokensAnnotation.class))
{
currNeToken = token.get(NamedEntityTagAnnotation.class);
String word = token.get(TextAnnotation.class);
if (currNeToken.equals("O"))
{
if (!prevNeToken.equals("O") && (sbuilder.length() > 0))
{
handleEntity(prevNeToken, sbuilder, tokens);
newToken = true;
}
continue;
}
if (newToken)
{
prevNeToken = currNeToken;
newToken = false;
sbuilder.append(word);
continue;
}
if (currNeToken.equals(prevNeToken))
{
sbuilder.append(" " + word);
}
else
{
handleEntity(prevNeToken, sbuilder, tokens);
newToken = true;
}
prevNeToken = currNeToken;
}
Any assistance would be greatly appreciated.

The answer is in how you set up the pipeline.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, regexner, depparse, natlog, openie");
//props.put("regexner.mapping", namedEntityPropertiesPath);
pipeline = new StanfordCoreNLP(props);
pipeline.addAnnotator(new TokensRegexNERAnnotator(namedEntityPropertiesPath, true));
Do not use props.put("regexner.mapping", namedEntityPropertiesPath);
Use pipeline.addAnnotator.
The first argument to the constructor is the path to your NER data file. The second is a boolean caseInsensitive.
Note, that this then uses Stanford's NER lists as well as your own. It also uses a more complex NER data file.
See http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/TokensRegexNERAnnotator.html

Related

String comparison not working for sharepoint multiline text values

I am fetching data from sharepoint list for a multi line column.
And then split the data by space and comparing it to other string but despite the value in both the strings being same it gives false result.
Please follow the below code:
string[] strBodys = SPHttpUtility.ConvertSimpleHtmlToText(Convert.ToString(workflowProperties.ListItem[SCMSConstants.lstfldBody]), Convert.ToString(workflowProperties.ListItem[SCMSConstants.lstfldBody]).Length).Split(' ');
bool hasKwrdInBody = false;
foreach (SPItem oItem in oColl)
{//get all the keywords
string[] strkeyWrds = SPHttpUtility.ConvertSimpleHtmlToText(Convert.ToString(oItem[SCMSConstants.lstfldKWConfigKeywordsIntrName]), Convert.ToString(oItem[SCMSConstants.lstfldKWConfigKeywordsIntrName]).Length).Split(',');
//in body
foreach (string strKW in strkeyWrds)
{
string KWValue = strKW.Trim(' ').ToLower();
foreach (string strBdy in strBodys)
{
string BodyValue = strBdy.Trim(' ').ToLower();
//if (strKW.ToLower().Equals(strBdy.ToLower()))
if(KWValue == BodyValue) //here it always gives false result
{
hasKwrdInBody = true;
break;
}
}
if (hasKwrdInBody)
break;
}
if (!hasKwrdInSbjct && !hasKwrdInBody)
{
continue;
}
else
{
//set business unit to current groups rule
bsnsUnitLookupFld = new SPFieldLookupValue(Convert.ToString(oItem[SCMSConstants.lstfldBsnsUnit]));
asgndTo = new SPFieldUserValue(objWeb,Convert.ToString(oItem[SCMSConstants.lstfldKWConfigAssignedToIntrName])).User;
groupName = Convert.ToString(oItem[SCMSConstants.lstfldKWConfigAssignedToGroupIntrName]).Split('#').Last();
break;
}
}
Please mind that i am trying to get multi line text from sharepoint list
Please provide your suggestions.

That also depends on the exact type of your Multiline field (e.g Plain Text or RichText, etc.).
Maybe it would be clear if you just added some logging writing out the values you are comparing.
For details on how to get the value of a Multiline textfield check Accessing Multiple line of text programmatically
and here for RichText

I got it working by comparing and counting the characters in both the strings. Actually some UTC codes were embedded in to the string. First I removed those characters using regular expression and then compared them and it worked like a charm.
Here is the code snippet, might help some one.
string[] strBodys = SPHttpUtility.ConvertSimpleHtmlToText(Convert.ToString(workflowProperties.ListItem[SCMSConstants.lstfldBody]), Convert.ToString(workflowProperties.ListItem[SCMSConstants.lstfldBody]).Length).Split(' ');
bool hasKwrdInBody = false;
foreach (SPItem oItem in oColl)
{//get all the keywords
string[] strkeyWrds = SPHttpUtility.ConvertSimpleHtmlToText(Convert.ToString(oItem[SCMSConstants.lstfldKWConfigKeywordsIntrName]), Convert.ToString(oItem[SCMSConstants.lstfldKWConfigKeywordsIntrName]).Length).Split(',');
//in body
foreach (string strKW in strkeyWrds)
{
string KWValue = strKW.Trim(' ').ToLower();
KWValue = Regex.Replace(KWValue, #"[^\u0000-\u007F]", string.Empty); //here replaced the utc codes
foreach (string strBdy in strBodys)
{
string BodyValue = strBdy.Trim(' ').ToLower();
BodyValue = Regex.Replace(BodyValue, #"\t|\n|\r", string.Empty); // new code to replace utc code
BodyValue = Regex.Replace(BodyValue, #"[^\u0000-\u007F]", string.Empty); //new code to replace utc code
//if (strKW.ToLower().Equals(strBdy.ToLower()))
if(KWValue == BodyValue) //here it always gives false result
{
hasKwrdInBody = true;
break;
}
}
if (hasKwrdInBody)
break;
}
if (!hasKwrdInSbjct && !hasKwrdInBody)
{
continue;
}
else
{
//set business unit to current groups rule
bsnsUnitLookupFld = new SPFieldLookupValue(Convert.ToString(oItem[SCMSConstants.lstfldBsnsUnit]));
asgndTo = new SPFieldUserValue(objWeb,Convert.ToString(oItem[SCMSConstants.lstfldKWConfigAssignedToIntrName])).User;
groupName = Convert.ToString(oItem[SCMSConstants.lstfldKWConfigAssignedToGroupIntrName]).Split('#').Last();
break;
}
}

Flow for conditionals inside sequence diagram

I need to document in UML sequence diagram the method setRepresentative. This is the code method:
class ReptoolController extends PageController {
private function setRepresentative($request, $action, $case)
{
...
$repappConfig = new RepappConfig();
$repappConfig = $this->getDoctrine()->getRepository('AppBundle:RepappConfig')->findOneBy(array("app_id"=>$id));
$project_id = $repappConfig->getProjectId();
$company_id = $repappConfig->getCompanyId();
$project = $this->getDoctrine()->getRepository('AppBundle:Project')->find($project_id);
$brand = $this->getDoctrine()->getRepository('AppBundle:Brand')->findOneBy(array("project"=>$project_id));
$company = $this->getDoctrine()->getRepository('AppBundle:Company')->find($company_id);
$territory = new Territory();
if(is_numeric($territory_name))
{
$tempName = "ID";
}
else
{
$tempName = "territory";
}
if($territory = $this->getDoctrine()->getRepository('AppBundle:Territory')->findOneBy(array($tempName=>$territory_name)))
{
$territory_id = $territory->getID();
$response->territory_id = $territory_id;
if($brand)
{
$is_enabled = 1;
$position = 1;
$brand_id = $brand->getID();
$terr_brand_xrf = $this->getDoctrine()->getRepository('AppBundle:TerritoryBrandXref')->findOneBy(array("territory"=>$territory_id, "brand"=>$brand_id));
if(!$terr_brand_xrf)
{
$terr_brand_xref = new TerritoryBrandXref($territory,$brand,$position);
$terr_brand_xref->setIsEnabled($is_enabled);
$terr_brand_xref->updateTimestamps();
$em = $this->getDoctrine()->getEntityManager();
$em->persist($terr_brand_xref);
$em->flush();
}
}
}
else
{
$territory->setTerritory($territory_name);
$territory->setProject($project);
$em = $this->getDoctrine()->getEntityManager();
$em->persist($territory);
$em->flush();
$territory_id = $territory->getID();
$response->territory_id = $territory_id;
if($brand)
{
$is_enabled = 1;
$position = 1;
$brand_id = $brand->getID();
$response->brand_id= $brand_id;
$terr_brand_xref = new TerritoryBrandXref($territory,$brand,$position);
$terr_brand_xref->setIsEnabled($is_enabled);
$terr_brand_xref->updateTimestamps();
$em = $this->getDoctrine()->getEntityManager();
$em->persist($terr_brand_xref);
$em->flush();
}
}
$controller_response = new Response( json_encode($response) );
$controller_response->headers->set('Content-Type', 'application/json; charset=utf-8');
return $controller_response;
}
}
This is the diagram as I have it now:
How do I diagram the conditionals inside this piece of code:
if($territory = $this->getDoctrine()->getRepository('PDOneBundle:Territory')->findOneBy(array($tempName=>$territory_name)))
{
...
} else {
...
}
How do I call the inside methods?

Actually what you are asking does not make sense (see my comment here: UML Sequence Diagram help needed). SDs are not meant to repeat algorithms in graphical notation. Code is much better for that purpose. The possibility to show loops and if conditions inside SDs is meant to be used only for a high level view of the system.
In your case you should concentrate on certain aspects of the runtime. Just like an important snapshot. Create a SD for the tech use case with a really sequential message flow. Eventually create more than one SD to light different aspects. But do NOT try to press the whole algorithm in a single SD.

Convert pinyin to Chinese Character

I want to take pinyin (english) as an input and return Chinese characters that user can choose from. I saw that this has been implemented in many place (support by OS keyboards and various websites), but can't find a library to do it.
Or possibly even doing it myself if it's not that complex or require large amount of data.

The simplest way to do this is use javachinesepinyin, a lightweight Chinese Pinyin Input Method.
You can find related code here.
private String[] pinyinToWord(String[] o) {
Result ret = null;
try {
ret = ptw.labelStateOfNodes(Arrays.asList(o));
} catch (Exception ex) {
System.out.println(ex.getMessage());
}
Map<Double, String> results = new HashMap<Double, String>();
if (null != ret && ret.states() != null) {
for (int pos = 0; pos < ret.states()[o.length - 1].length; pos++) {
StringBuilder sb = new StringBuilder();
int[] statePath = Viterbi.getStatePath(ret.states(), ret.psai(), o.length - 1, o.length, pos);
for (int state : statePath) {
Character name = ptw.getStateBy(state);
sb.append(name).append(" ");
}
results.put(ret.delta()[o.length - 1][pos], sb.toString());
}
List<Double> list = new ArrayList<Double>(results.keySet());
Collections.sort(list);
Collections.reverse(list);
return results.get(list.get(0)).trim().split(" ");
}
return null;
}
Intro Slides in English: http://docs.google.com/present/edit?id=0AbbbdNFzwcADZGR3Z3N0NG1fMTk4M2hraGZjNmRw&hl=en
Live Demo: http://951438.appspot.com/pinyin.jsp?txt=zhongwenpinyinshurufa
If advanced features are needed, maybe you should consider use Rime Input Method Engine or sunpinyin.
FYI, Python Binding for sunpinyin.

Sentiment Analysis(SentiWordNet) - Judging the context of a sentence

I am trying to find whether a sentence is Positive or Negative in the following steps:
1.) Retrieving the Parts of speech(verbs, nouns, adjectives etc) from the sentence using the Stanford NLP parser.
2.) Using the SentiWordNet to find the Positive and Negative values related to each Part of Speech.
3.) Summing up the Positive and Negative values obtained to calculate a Net Positive and Net Negative value related to a sentence.
But the problem is that, the SentiWordNet return a list of Positive/Negative values based on different senses/contexts. Is it possible to pass a particular sentence along with the part of speech to the SentiWordNet parser, so that it can judge the sense/context automatically and returns only one pair of Positive and Negative value?
Or is there any other alternate solution to this problem?
Thanks.

SentoWordNet Demo Code
This may help you.
// Copyright 2013 Petter Törnberg
//
// This demo code has been kindly provided by Petter Törnberg <pettert#chalmers.se>
// for the SentiWordNet website.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class SentiWordNetDemoCode {
private Map<String, Double> dictionary;
public SentiWordNetDemoCode(String pathToSWN) throws IOException {
// This is our main dictionary representation
dictionary = new HashMap<String, Double>();
// From String to list of doubles.
HashMap<String, HashMap<Integer, Double>> tempDictionary = new HashMap<String, HashMap<Integer, Double>>();
BufferedReader csv = null;
try {
csv = new BufferedReader(new FileReader(pathToSWN));
int lineNumber = 0;
String line;
while ((line = csv.readLine()) != null) {
lineNumber++;
// If it's a comment, skip this line.
if (!line.trim().startsWith("#")) {
// We use tab separation
String[] data = line.split("\t");
String wordTypeMarker = data[0];
// Example line:
// POS ID PosS NegS SynsetTerm#sensenumber Desc
// a 00009618 0.5 0.25 spartan#4 austere#3 ascetical#2
// ascetic#2 practicing great self-denial;...etc
// Is it a valid line? Otherwise, through exception.
if (data.length != 6) {
throw new IllegalArgumentException(
"Incorrect tabulation format in file, line: "
+ lineNumber);
}
// Calculate synset score as score = PosS - NegS
Double synsetScore = Double.parseDouble(data[2])
- Double.parseDouble(data[3]);
// Get all Synset terms
String[] synTermsSplit = data[4].split(" ");
// Go through all terms of current synset.
for (String synTermSplit : synTermsSplit) {
// Get synterm and synterm rank
String[] synTermAndRank = synTermSplit.split("#");
String synTerm = synTermAndRank[0] + "#"
+ wordTypeMarker;
int synTermRank = Integer.parseInt(synTermAndRank[1]);
// What we get here is a map of the type:
// term -> {score of synset#1, score of synset#2...}
// Add map to term if it doesn't have one
if (!tempDictionary.containsKey(synTerm)) {
tempDictionary.put(synTerm,
new HashMap<Integer, Double>());
}
// Add synset link to synterm
tempDictionary.get(synTerm).put(synTermRank,
synsetScore);
}
}
}
// Go through all the terms.
for (Map.Entry<String, HashMap<Integer, Double>> entry : tempDictionary
.entrySet()) {
String word = entry.getKey();
Map<Integer, Double> synSetScoreMap = entry.getValue();
// Calculate weighted average. Weigh the synsets according to
// their rank.
// Score= 1/2*first + 1/3*second + 1/4*third ..... etc.
// Sum = 1/1 + 1/2 + 1/3 ...
double score = 0.0;
double sum = 0.0;
for (Map.Entry<Integer, Double> setScore : synSetScoreMap
.entrySet()) {
score += setScore.getValue() / (double) setScore.getKey();
sum += 1.0 / (double) setScore.getKey();
}
score /= sum;
dictionary.put(word, score);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (csv != null) {
csv.close();
}
}
}
public double extract(String word, String pos) {
return dictionary.get(word + "#" + pos);
}
public static void main(String [] args) throws IOException {
if(args.length<1) {
System.err.println("Usage: java SentiWordNetDemoCode <pathToSentiWordNetFile>");
return;
}
String pathToSWN = args[0];
SentiWordNetDemoCode sentiwordnet = new SentiWordNetDemoCode(pathToSWN);
System.out.println("good#a "+sentiwordnet.extract("good", "a"));
System.out.println("bad#a "+sentiwordnet.extract("bad", "a"));
System.out.println("blue#a "+sentiwordnet.extract("blue", "a"));
System.out.println("blue#n "+sentiwordnet.extract("blue", "n"));
}
}

We can pass the pos to sentiwordnet parser.
Download pattern python module
from pattern.en import wordnet
print wordnet.synsets("kill",pos="VB")[0].weight
wordnet.synsets returns list of synsets
and from that we are selecting 1st item
Output will be a tuple of (polarity,subjectivity)
Hope this helps...

Multi-term named entities in Stanford Named Entity Recognizer

I'm using the Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml and it's working fine. This is
List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) {
namedEntities.add(word.word().trim());
}
}
}
However the problem I'm finding is identifying names and surnames. If the recognizer encounters "Joe Smith", it is returning "Joe" and "Smith" separately. I'd really like it to return "Joe Smith" as one term.
Could this be achieved through the recognizer maybe through a configuration? I didn't find anything in the javadoc till now.
Thanks!

This is because your inner for loop is iterating over individual tokens (words) and adding them separately. You need to change things to add whole names at once.
One way is to replace the inner for loop with a regular for loop with a while loop inside it which takes adjacent non-O things of the same class and adds them as a single entity.*
Another way would be to use the CRFClassifier method call:
List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)
which will give you whole entities, which you can extract the String form of by using substring on the original input.
*The models that we distribute use a simple raw IO label scheme, where things are labeled PERSON or LOCATION, and the appropriate thing to do is simply to coalesce adjacent tokens with the same label. Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012).

The counterpart of the classifyToCharacterOffsets method is that (AFAIK) you can't access the label of the entities.
As proposed by Christopher, here is an example of a loop which assembles "adjacent non-O things". This example also counts the number of occurrences.
public HashMap<String, HashMap<String, Integer>> extractEntities(String text){
HashMap<String, HashMap<String, Integer>> entities =
new HashMap<String, HashMap<String, Integer>>();
for (List<CoreLabel> lcl : classifier.classify(text)) {
Iterator<CoreLabel> iterator = lcl.iterator();
if (!iterator.hasNext())
continue;
CoreLabel cl = iterator.next();
while (iterator.hasNext()) {
String answer =
cl.getString(CoreAnnotations.AnswerAnnotation.class);
if (answer.equals("O")) {
cl = iterator.next();
continue;
}
if (!entities.containsKey(answer))
entities.put(answer, new HashMap<String, Integer>());
String value = cl.getString(CoreAnnotations.ValueAnnotation.class);
while (iterator.hasNext()) {
cl = iterator.next();
if (answer.equals(
cl.getString(CoreAnnotations.AnswerAnnotation.class)))
value = value + " " +
cl.getString(CoreAnnotations.ValueAnnotation.class);
else {
if (!entities.get(answer).containsKey(value))
entities.get(answer).put(value, 0);
entities.get(answer).put(value,
entities.get(answer).get(value) + 1);
break;
}
}
if (!iterator.hasNext())
break;
}
}
return entities;
}

I had the same problem, so I looked it up, too. The method proposed by Christopher Manning is efficient, but the delicate point is to know how to decide which kind of separator is appropriate. One could say only a space should be allowed, e.g. "John Zorn" >> one entity. However, I may find the form "J.Zorn", so I should also allow certain punctuation marks. But what about "Jack, James and Joe" ? I might get 2 entities instead of 3 ("Jack James" and "Joe").
By digging a bit in the Stanford NER classes, I actually found a proper implementation of this idea. They use it to export entities under the form of single String objects. For instance, in the method PlainTextDocumentReaderAndWriter.printAnswersTokenizedInlineXML, we have:
private void printAnswersInlineXML(List<IN> doc, PrintWriter out) {
final String background = flags.backgroundSymbol;
String prevTag = background;
for (Iterator<IN> wordIter = doc.iterator(); wordIter.hasNext();) {
IN wi = wordIter.next();
String tag = StringUtils.getNotNullString(wi.get(AnswerAnnotation.class));
String before = StringUtils.getNotNullString(wi.get(BeforeAnnotation.class));
String current = StringUtils.getNotNullString(wi.get(CoreAnnotations.OriginalTextAnnotation.class));
if (!tag.equals(prevTag)) {
if (!prevTag.equals(background) && !tag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
} else if (!prevTag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
} else if (!tag.equals(background)) {
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
}
} else {
out.print(before);
}
out.print(current);
String afterWS = StringUtils.getNotNullString(wi.get(AfterAnnotation.class));
if (!tag.equals(background) && !wordIter.hasNext()) {
out.print("</");
out.print(tag);
out.print('>');
prevTag = background;
} else {
prevTag = tag;
}
out.print(afterWS);
}
}
They iterate over each word, checking if it has the same class (answer) than the previous, as explained before. For this, they take advantage of the fact expressions considered as not being entities are flagged using the so-called backgroundSymbol (class "O"). They also use the property BeforeAnnotation, which represents the string separating the current word from the previous one. This last point allows solving the problem I initially raised, regarding the choice of an appropriate separator.

Code for the above:
<List> result = classifier.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> triple : result)
{
System.out.println(triple.first + " : " + text.substring(triple.second, triple.third));
}

List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
String s = "";
String prevLabel = null;
for (CoreLabel word : sentence) {
if(prevLabel == null || prevLabel.equals(word.get(CoreAnnotations.AnswerAnnotation.class)) ) {
s = s + " " + word;
prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
}
else {
if(!prevLabel.equals("O"))
System.out.println(s.trim() + '/' + prevLabel + ' ');
s = " " + word;
prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
}
}
if(!prevLabel.equals("O"))
System.out.println(s + '/' + prevLabel + ' ');
}
I just wrote a small logic and it's working fine. what I did is group words with same label if they are adjacent.

Make use of the classifiers already provided to you. I believe this is what you are looking for:
private static String combineNERSequence(String text) {
String serializedClassifier = "edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = null;
try {
classifier = CRFClassifier
.getClassifier(serializedClassifier);
} catch (ClassCastException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(classifier.classifyWithInlineXML(text));
// FOR TSV FORMAT //
//System.out.print(classifier.classifyToString(text, "tsv", false));
return classifier.classifyWithInlineXML(text);
}

Here is my full code, I use Stanford core NLP and write algorithm to concatenate Multi Term names.
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import org.apache.log4j.Logger;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
/**
* Created by Chanuka on 8/28/14 AD.
*/
public class FindNameEntityTypeExecutor {
private static Logger logger = Logger.getLogger(FindNameEntityTypeExecutor.class);
private StanfordCoreNLP pipeline;
public FindNameEntityTypeExecutor() {
logger.info("Initializing Annotator pipeline ...");
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
pipeline = new StanfordCoreNLP(props);
logger.info("Annotator pipeline initialized");
}
List<String> findNameEntityType(String text, String entity) {
logger.info("Finding entity type matches in the " + text + " for entity type, " + entity);
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
List<String> matches = new ArrayList<String>();
for (CoreMap sentence : sentences) {
int previousCount = 0;
int count = 0;
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
int previousWordIndex;
if (entity.equals(token.get(CoreAnnotations.NamedEntityTagAnnotation.class))) {
count++;
if (previousCount != 0 && (previousCount + 1) == count) {
previousWordIndex = matches.size() - 1;
String previousWord = matches.get(previousWordIndex);
matches.remove(previousWordIndex);
previousWord = previousWord.concat(" " + word);
matches.add(previousWordIndex, previousWord);
} else {
matches.add(word);
}
previousCount = count;
}
else
{
count=0;
previousCount=0;
}
}
}
return matches;
}
}

Another approach to deal with multi words entities.
This code combines multiple tokens together if they have the same annotation and go in a row.
Restriction:
If the same token has two different annotations, the last one will be saved.
private Document getEntities(String fullText) {
Document entitiesList = new Document();
NERClassifierCombiner nerCombClassifier = loadNERClassifiers();
if (nerCombClassifier != null) {
List<List<CoreLabel>> results = nerCombClassifier.classify(fullText);
for (List<CoreLabel> coreLabels : results) {
String prevLabel = null;
String prevToken = null;
for (CoreLabel coreLabel : coreLabels) {
String word = coreLabel.word();
String annotation = coreLabel.get(CoreAnnotations.AnswerAnnotation.class);
if (!"O".equals(annotation)) {
if (prevLabel == null) {
prevLabel = annotation;
prevToken = word;
} else {
if (prevLabel.equals(annotation)) {
prevToken += " " + word;
} else {
prevLabel = annotation;
prevToken = word;
}
}
} else {
if (prevLabel != null) {
entitiesList.put(prevToken, prevLabel);
prevLabel = null;
}
}
}
}
}
return entitiesList;
}
Imports:
Document: org.bson.Document;
NERClassifierCombiner: edu.stanford.nlp.ie.NERClassifierCombiner;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Stanford NLP: set RegexNERAnnotator to caseInsensitive - nlp

Related

String comparison not working for sharepoint multiline text values

Flow for conditionals inside sequence diagram

Convert pinyin to Chinese Character

Sentiment Analysis(SentiWordNet) - Judging the context of a sentence

Multi-term named entities in Stanford Named Entity Recognizer

Categories

Resources