Improving accuracy of speech recognition using Vosk (Kaldi) running on Android - speech-to-text

I am developing an application to collect data in the field on Android devices using speech recognition. There are five "target words", as well as several numbers (zero, one, ten, one-hundred, etc) that are recognized.
I have improved accuracy of the target words by adding homonyms (homophones) as well as vernacular synonyms. Target words are Chinook, sockeye, coho, pink, and chum. This is the relevant code,
public void parseWords() {
List<String> szlNumbers = Arrays.asList(new String[]{"ONE", "TEN", "ONE HUNDRED", "ONE THOUSAND", "TEN THOUSAND"});
//species with phonemes and vernacular names
List<String> szlChinook = Arrays.asList("CHINOOK", "CHINOOK SALMON", "KING", "KINGS", "KING SALMON", "KING SALMAN");
List<String> szlSockeye = Arrays.asList("SOCKEYE", "SOCCER", "SOCKEYE SALMON", "SOCK ICE", "SOCCER ICE", "SOCK I SAID", "SOCCER IS", "OKAY SALMON", "RED SALMON", "READ SALMON", "RED", "REDS");
List<String> szlCoho = Arrays.asList("COHO", "COHO SALMON", "COVER SALMON", "SILVER SALMON", "SILVER", "SILVERS", "CO", "KOBO", "GO HOME", "COMO", "COVER", "GO");
List<String> szlPink = Arrays.asList("PINK", "A PINK", "PINKS", "PINK SALMON", "HANK SALMON", "EXAMINE", "HUMPY", "HOBBY", "HUMPIES", "HUM BE", "HUM P", "BE", "HUMPTY", "HOBBIES", "HUMVEE", "THE HUMVEES", "POMPEY");
List<String> szlChum = Arrays.asList("CHUM", "JOHN", "JUMP", "SHARMA", "CHARM", "COME", "CHARM SALMON", "COME SALMON", "CHUM SALMON", "JUMP SALMON", "TRUMP SALMON", "KETA SALMON", "KETA", "DOG", "DOGS", "DOG SALMON", "GATOR", "GATORS", "CALICO", "A CALICO");
//Collections.sort(szlChinook); //what is this?
szVoskOutput=szVoskOutput.toUpperCase();
if (szVoskOutput.compareTo("")==0){
//do nothing, this is a blank string
return;
}
if(szVoskOutput==null){//...and this is a null string
return;
}
//pink
if (szlPink.contains(szVoskOutput)) {
szSpecies = "Pink";
populateSpecies();
return;
}
//chum
if (szlChum.contains(szVoskOutput)) {
szSpecies = "Chum";
populateSpecies();
return;
}
//sockeye
if (szlSockeye.contains(szVoskOutput)) {
szSpecies = "Sockeye";
populateSpecies();
return;
}
//coho
if (szlCoho.contains(szVoskOutput)) {
szSpecies = "Coho";
populateSpecies();
return;
}
//Chinook
if (szlChinook.contains(szVoskOutput)) {
szSpecies = "Chinook";
populateSpecies();
return;
}
if(szlNumbers.contains(szVoskOutput)) {//then this is a number, put in count txt box
tvCount.setText(szVoskOutput);
return;
}else{
Toast.makeText(this, "Please repeat clearly. Captured string is:" + szVoskOutput, Toast.LENGTH_SHORT).show();
}
}//end parseWords()
I have a streamlined version of the application with source code on GitHub: https://github.com/portsample/salmonTalkerLite
as well as the latest full version on Google Play: https://play.google.com/store/apps/details?id=net.blepsias.salmontalker
Using the target word and homonyms, I can get hits in four to five seconds. I would like to make this faster. What can I do to further tune for speed?

This helped out significantly. Recognition time is now consistently about 1.5 seconds.
private void recognizeMicrophone() {
if (speechService != null) {
setUiState(iSTATE_DONE);
speechService.stop();
speechService = null;
} else {
setUiState(iSTATE_MIC);
try {
Recognizer rec = new Recognizer(model, 16000.f, "[\"sockeye pink coho chum chinook atlantic salmon\","[unk]"]");
speechService = new SpeechService(rec, 16000.0f);
speechService.startListening(this);
} catch (IOException e) {
setErrorState(e.getMessage());
}
}
}
This clears out the upstream extraineous Vosk output leaving only specified target words. This will eliminate the need for the elaborate homonym sorting conditionals shown in the original post. Thanks to Nickolay Shmyrev for this.
I am still looking for other methods to speed recognition up, or otherwise improve this process.
Updates and improvements will be reflected in source code on GitHub: https://github.com/portsample/salmonTalkerLite

Related

Problem with getEmail() Google sheet, not working for other persons?

I have a problem with getting email.
I have checked a lot of threads but I have not found the solution.
So I am the leader of the group. I am using Google Sheets for the Checklist.
I would like to simplify for every person so that it would be easier and no reason not to fulfill checklists.
basically: every time there is a checkbox clicked, automatic put in of date and email. But email works only for me. I have tried with a friend's account and I can not get it working.
is this Security Politics? Can i change it somehow?
I also updated the appsscript.json
{
"oauthScopes":
[
"https://www.googleapis.com/auth/spreadsheets.readonly",
"https://www.googleapis.com/auth/userinfo.email"
],
"timeZone": "Europe/Paris",
"dependencies": {
},
"exceptionLogging": "STACKDRIVER",
"runtimeVersion": "V8"
}
the script.
function onEdit(e) {
if(e.value != "TRUE" ) return;
e.source.getActiveSheet().getRange(e.range.rowStart,e.range.columnStart+2).setValue(new Date());
var text = e.user.getEmail();
if (text == "xxx1#gmail.com")
{
text = "xxx1";
}
else if (text == "Pxxx2#gmail.com")
{
text = "Pxxx2";
}
else if (text == "Kxxx2#gmail.com")
{
text = "Kxxx2";
}
else if (text == "kxxx1#gmail.com")
{
text = "Kxxx1";
}
else if (text == "jxxx1#gmail.com")
{
text = "Jxxx1";
}
else if (text == "jxxx2#gmail.com")
{
text = "Jxxx2";
}
else if (text == "")
{
text = "ID not found";
}
else
{
text = e.user.getEmail();
}
e.source.getActiveSheet().getRange(e.range.rowStart,e.range.columnStart+4).setValue(text);
}

Non-repeating randomly generated questions

I'm trying to generate random questions into a quiz. Currently everything is fine but the questions are repeating, how would you keep them from repeating? I've read several articles but I just don't quite understand how to implement the code.
public class CplusQuiz extends AppCompatActivity {
Button answer1, answer2, answer3, answer4;
TextView score, question;
private Questions mQuestions = new Questions();
private String mAnswer;
private int mScore = 0;
private int mQuestionLength = mQuestions.mQuestions.length;
Random r;
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_cplus_quiz);
r = new Random();
answer1 = (Button) findViewById(R.id.answer1);
answer2 = (Button) findViewById(R.id.answer2);
answer3 = (Button) findViewById(R.id.answer3);
answer4 = (Button) findViewById(R.id.answer4);
score = (TextView) findViewById(R.id.score);
question = (TextView) findViewById(R.id.question);
score.setText("Nerd Level: " + mScore);
updateQuestion(r.nextInt(mQuestionLength));
answer1.setOnClickListener(new View.OnClickListener() {
#Override
public void onClick(View v) {
if(answer1.getText() == mAnswer) {
mScore++;
score.setText("Score: "+ mScore);
updateQuestion(r.nextInt(mQuestionLength));
}
else {
gameOver();
}
}
});
answer2.setOnClickListener(new View.OnClickListener() {
#Override
public void onClick(View v) {
if(answer2.getText() == mAnswer) {
mScore++;
score.setText("Score: "+ mScore);
updateQuestion(r.nextInt(mQuestionLength));
}
else {
gameOver();
}
}
});
answer3.setOnClickListener(new View.OnClickListener() {
#Override
public void onClick(View v) {
if(answer3.getText() == mAnswer) {
mScore++;
score.setText("Score: "+ mScore);
updateQuestion(r.nextInt(mQuestionLength));
}
else {
gameOver();
}
}
});
answer4.setOnClickListener(new View.OnClickListener() {
#Override
public void onClick(View v) {
if(answer4.getText() == mAnswer) {
mScore++;
score.setText("Score: "+ mScore);
updateQuestion(r.nextInt(mQuestionLength));
}
else {
gameOver();
}
}
});
}
private void updateQuestion(int num) {
question.setText(mQuestions.getQuestion(num));
answer1.setText(mQuestions.getChoice1(num));
answer2.setText(mQuestions.getChoice2(num));
answer3.setText(mQuestions.getChoice3(num));
answer4.setText(mQuestions.getChoice4(num));
mAnswer = mQuestions.getCorrectAnswer(num);
}
private void gameOver() {
AlertDialog.Builder alertDialogBuilder = new AlertDialog.Builder(CplusQuiz.this);
alertDialogBuilder
.setMessage("Epic Fail... Your nerd level is " + mScore + " ")
.setCancelable(false)
.setPositiveButton("Start Over",
new DialogInterface.OnClickListener() {
#Override
public void onClick(DialogInterface dialog, int i) {
startActivity(new Intent(getApplicationContext(), CplusQuiz.class));
finish();
}
})
.setNegativeButton("EXIT TO MAIN",
new DialogInterface.OnClickListener() {
#Override
public void onClick(DialogInterface dialog, int i) {
startActivity(new Intent(getApplicationContext(), MainActivity.class));
finish();
}
});
AlertDialog alertDialog = alertDialogBuilder.create();
alertDialog.show();
}
}
Questions.java file
package com.example.max.quiz;
/**
* Created by max on 4/24/2017.
*/
public class Questions {
public String mQuestions[] = {
"WHO IS THE FASTEST OF THESE VIDEO GAME CHARACTERS?",
"IN THE GAME HALO, WHAT IS THE NAME OF MASTER CHIEF'S AI SIDEKICK?",
"WHICH BAD GUY WAS INTRODUCED IN SUPER MARIO BROTHERS 2?",
"WHAT VIDEO GAME CONSOLE HAS THE HIGHEST NUMBER OF VIDEO GAME CONSOLE SALES OF ALL TIME?",
"WHICH OF THESE DO YOU NOT DO IN WE LOVE KATAMARI, THE SEQUEL TO KATAMARI DAMACY?",
"WHICH OF THESE BANDS IS NOT FEATURED IN GUITAR HERO III: LEGENDS OF ROCK?",
"HOW MANY UNLOCKABLE CHARACTERS CAN BE FOUND IN SUPER SMASH BROTHERS?",
"WHAT WAS NINTENDOS FIRST TRY AT AN ARCADE GAME?",
"WHICH DOES NOT HAVE WIFI?",
"WHAT VIDEO GAME CONSOLE HAS THE HIGHEST NUMBER OF VIDEO GAME CONSOLE SALES OF ALL TIME?",
};
private String mChoices[][] = {
{"Mario", "Sonic", "Donkey Kong", "The Paperboy"},
{"Cortana", "Arbiter", "343 Guilty Spark", "HAL"},
{"Koopa troopa", "Lakitu", "Shy Guy", "Goomba"},
{"Xbox 360", "Nintendo 64", "Wii", "PlayStation 2"},
{"roll around under water", "roll around while on fire", "roll around on the moon", "roll around a sumo wrestler"},
{"Metallica", "Weezer", "Iron Maiden", "Lynyrd Skynyrd"},
{"1", "2", "3", "4"},
{"Super Mario Brothers", "Donkey Kong Jr.", "Donkey Kong", "Final Fantasy"},
{"Mario Kart DS", "Diddy Kong Racing DS", "Tony Hawk's American Sk8land", "Super Mario 64 DS"},
{"Xbox 360", "Nintendo 64", "Wii", "PlayStation 2"},
};
private String mCorrectAnswers[] = {"Sonic", "Cortana", "Shy Guy", "PlayStation 2","roll around on the moon", "Iron Maiden", "4", "Donkey Kong", "Super Mario 64 DS", "PlayStation 2" };
public String getQuestion(int a) {
String question = mQuestions[a];
return question;
}
public String getChoice1(int a) {
String choice = mChoices[a][0];
return choice;
}
public String getChoice2(int a) {
String choice = mChoices[a][1];
return choice;
}
public String getChoice3(int a) {
String choice = mChoices[a][2];
return choice;
}
public String getChoice4(int a) {
String choice = mChoices[a][3];
return choice;
}
public String getCorrectAnswer (int a) {
String answer = mCorrectAnswers [a];
return answer;
}
}
Here's a solution in JavaScript - you can follow the logic and the hopefully apply the concepts to your Java project. There's a running demo in this Plunkr https://plnkr.co/edit/KcHh63ou25LZDaZ79iwI?p=preview
Basically what we're doing is creating an initial array of all the possible questions, and an empty array of questions we're going to include on the quiz. We decide how many questions to include, in this example 5, and then loop that number of times.
On each iteration, we get the current length of the array holding all the possible questions, and pick a random number between 0 and the number of questions left in the array, then we push the question at that index value onto our array of quiz questions, and slice that question out of our array of possible questions so it's not there to get picked a second time.
Then we iterate again, now the source question array is one shorter, we get it's new length, pick a new random number and grab that question, add it to our quiz array, slice it out of our source questions array.
Lather, rinse, repeat.
questionsSource = [{
"question": "What is your favourite colour?",
"answer": "blue",
}, {
"question": "What is the average airspeed of a coconut laden swallow?",
"answer": "African or European Swallow?",
}, {
"question": "Question 3?",
"answer": "Answer 3",
}, {
"question": "Question 4?",
"answer": "Answer 4",
}, {
"question": "Question 5?",
"answer": "Answer 5",
}, {
"question": "Question 6?",
"answer": "Answer 6",
}, {
"question": "Question 7?",
"answer": "Answer 7",
}, {
"question": "Question 8?",
"answer": "Answer 8",
}]
questionsLeft = questionsSource;
questionsPicked = [];
questionsInQuiz = 5; // how many questions you want to look up
for (i = 0; i < questionsInQuiz; i++) {
questionsLeftInArray = this.questionsLeft.length;
randNumber = Math.floor(Math.random() * questionsLeftInArray);
// pick a random number between 0 and the number of questions left in the array.
this.questionsPicked.push(questionsLeft[randNumber]);
// add the randomly selected question to the questionsPicked array.
this.questionsLeft.splice(randNumber, 1);
// remove that question from the remaining questions so it can't be picked again.
}
console.log("QuestionsSource", questionsSource);
console.log("questionsPicked", questionsPicked);
output = JSON.stringify(questionsPicked, null, 4)
document.write(output);

How to have multiple keys for hashmap

private void createRooms()
{
myNeighbor = new HashMap <String, Room> ();
crumbs = new Item("Crumbs", "small crumbs of some kind of food", 100);
eggs = new Item("Raw Eggs", "a couple of raw eggs still contained within their egg shells", 1100);
cellPhone = new Item("Cell Phone", "Mike's cell phone he must have forgotten here...", 0);
textBooks = new Item("Textbooks", "Jay's textbooks, because he can't use his bedroom to store his stuff", 0);
poptarts = new Item("Pop Tarts", "an un-opened box of chocolate pop tarts that someone must have left behind...", 1500);
pizzaRolls = new Item("Pizza Rolls", "cooked steaming pizza rolls piled high", 2000);
clothes = new Item("Clothes", "clothes, a lot of clothes all over the floor and all over the room, who knows if they're clean or not...", 0);
// miningTools = new Item("Mining Tools", "pickaxes, drills, and everything else you need to extract rocks and minerals from the earth's crust", 100);
chips = new Item("Chips", "chip bag hidden away that is only half full now", 400);
hallway = new Room("in a dark hallway with crumbs scattered over the ground", crumbs);
kitchen = new Room("in a kitchen with raw eggs lying on the counter tops", eggs);
bathroom = new Room("in a bathroom with a stand up shower, a washer, a drier, and Mike's cell phone left behind laying on the counter", cellPhone);
livingRoom = new Room("in a living room with Jay's textbooks all over the room", textBooks);
upstairsLobby = new Room("in a lobby at the top of the stairs with a box of pop tarts on the ground", poptarts);
blakesRoom = new Room("in a dark room with towers of pizza rolls covering the desk and scattered across the bed", pizzaRolls);
jaysRoom = new Room("in a cluttered room with clothes covering every inch of the floor and nothing hanging on the walls", clothes);
mikesRoom = new Room("in a bed room with mining tools and a bag of chips hidden underneath a pillow on the bed", chips);
hallway.addNeighbor("north", kitchen);
hallway.addNeighbor("west", upstairsLobby);
hallway.addNeighbor("east", livingRoom);
kitchen.addNeighbor("west", bathroom);
kitchen.addNeighbor("south", hallway);
bathroom.addNeighbor("east", kitchen);
livingRoom.addNeighbor("west", hallway);
upstairsLobby.addNeighbor("north", jaysRoom);
upstairsLobby.addNeighbor("west", blakesRoom);
upstairsLobby.addNeighbor("east", mikesRoom);
upstairsLobby.addNeighbor("south", hallway);
blakesRoom.addNeighbor("east", upstairsLobby);
jaysRoom.addNeighbor("south", upstairsLobby);
mikesRoom.addNeighbor("west", upstairsLobby);
}
Room class
import java.util.HashMap;
/**
* Write a description of class Room here.
*
* #author (Christopher a date)
*/
public class Room
{
private String description;
private Item item;
private HashMap <String, Room> myNeighbor;
public Room (String pDescription)
{
description = pDescription;
item = null;
HashMap <String, Room> myNeighbor = new HashMap <String, Room> ();
}
public Room (String pDescription, Item pItem)
{
description = pDescription;
item = pItem;
}
public String getDescription()
{
return description;
}
public Item getItem()
{
return item;
}
public void addItem(Item i)
{
item = i;
}
public boolean hasItem()
{
if (item != null)
return true;
else
return false;
}
public void addNeighbor(String pDirection, Room r)
{
myNeighbor = new HashMap <String, Room> ();
myNeighbor.put(pDirection, r);
}
public Room getNeighbor(String pDirection)
{
Room next = myNeighbor.get(pDirection);
if(next != null){
return next;
}
else{
return null;
}
}
public Item removeItem()
{
Item temp;
temp = item;
item = null;
return temp;
}
public String getLongDescription()
{
String part1 = "You are " + description;
String part2 = "You see ";
if(item != null){
return part1 + "" + part2 + "" + item.getDescription() + "" + item.getCalories();
}
return part1;
}
}
Long story short, the point of this is to add Rooms and be able to naviage them and pick up items and drop them. It has just been brought to my attention as I try to run the program that I can't have multiple north/south/east/west keys. How can I get around this so I can make this work?
It wont allow me to comment so...
I am not sure what your ROOM class looks like but I am guessing it is intialized with a hasmap in the constuctor, and ahs a method called addNeighbor to actuallymodify this hash map?
----EDIT-----
Seeing your AddNeighbor method shows that you create a new hasmap every time you add a neighbor to the hashmap. There is no need and you alraedy craeted MyNeighbor in the constuctor, now you can just "put" they new key, value combination in the hash map
Just remove the line to create a new hasmap every time.
Assuming that you want to be able to write:
Room targetRoom = currentRoom.neighbour("north");
then you need to change your design.
The neighbours need to be member (ivars) of a room, like this for example:
class Room;
typedef HashMap<string, Room*> NeighbouringRooms;
public class Room {
...
public NeighbouringRooms const& neighbour() const {
return _neighbours;
}
private NeighbouringRooms neighbours;
}
(I've omitted some details inside the class, like adding a neighbour to a room.)
Now, since there are only 4 possible directions (N, S, E, W), an array of neighbours for each room would do the trick as well.
public class Room {
public Room neighbours[4];
...
}
Room room;
room.neighbour[north] = ... ;

ObjectGraphBuilder from and to a file

How can I make ObjectGraphBuilder to build my class instance from an string? I mean if I have
String myString = """invoices{
invoice(date: new Date(106,1,2)){
item(count:5){
product(name:'ULC', dollar:1499)
}
item(count:1){
product(name:'Visual Editor', dollar:499)
}
}
invoice(date: new Date(106,1,2)){
item(count:4) {
product(name:'Visual Editor', dollar:499)
}
}
"""
how can turn this string (myString) into an instance of the invoice class (I assume I have to use ObjectGraphBuilder but how?)
Given an instance of the class invoice ( with all of its nested properties), how can I turn that instance into an string like myString?
I also want to be able serialize and deserialize from a text file too but I assume it is the same as the string.
You can work with GroovyShell to evaluate the string and delegate the methods called in the script to an ObjectGraphBuilder. I repeated the "invoices" method. If this is unacceptable, take a look at Going to Mars with Domain-Specific Languages, by Guillaume Laforge, where he teaches how to customize the compiler.
I also created an Invoices class, because of the way ObjectGraphBuilder works. If this will be dynamic for you, take a look at its resolvers.
import groovy.transform.ToString as TS
#TS class Invoices { List<Invoice> invoices=[] }
#TS class Invoice { List<Item> items=[]; Date date }
#TS class Item { Integer count; Product product }
#TS class Product { String name; Integer dollar; Vendor vendor }
#TS class Vendor { Integer id }
String myString = """
invoices {
invoice(date: new Date(106,1,2)){
item(count:5){
product(name:'ULC', dollar:1499)
}
item(count:1){
product(name:'Visual Editor', dollar:499)
}
}
invoice(date: new Date(106,1,2)){
item(count:4) {
product(name:'Visual Editor', dollar:499)
}
}
}
"""
invoicesParser = { Closure c ->
new ObjectGraphBuilder().invoices c
}
binding = new Binding( [invoices: invoicesParser] )
invoices = new GroovyShell(binding).evaluate myString
assert invoices.invoices.size() == 2
Update: as for your second question, i'm not aware, and neither could found, any way back to the object graph builder representation. You can roll your own, but i think you will be better if you try something like json. Does your use case permit you to do so?
use( groovy.json.JsonOutput ) {
assert invoices.toJson().prettyPrint() == """{
"invoices": [
{
"date": "2006-02-02T02:00:00+0000",
"items": [
{
"product": {
"vendor": null,
"dollar": 1499,
"name": "ULC"
},
"count": 5
},
{
"product": {
"vendor": null,
"dollar": 499,
"name": "Visual Editor"
},
"count": 1
}
]
},
{
"date": "2006-02-02T02:00:00+0000",
"items": [
{
"product": {
"vendor": null,
"dollar": 499,
"name": "Visual Editor"
},
"count": 4
}
]
}
]
}"""
}

Multi-term named entities in Stanford Named Entity Recognizer

I'm using the Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml and it's working fine. This is
List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) {
namedEntities.add(word.word().trim());
}
}
}
However the problem I'm finding is identifying names and surnames. If the recognizer encounters "Joe Smith", it is returning "Joe" and "Smith" separately. I'd really like it to return "Joe Smith" as one term.
Could this be achieved through the recognizer maybe through a configuration? I didn't find anything in the javadoc till now.
Thanks!
This is because your inner for loop is iterating over individual tokens (words) and adding them separately. You need to change things to add whole names at once.
One way is to replace the inner for loop with a regular for loop with a while loop inside it which takes adjacent non-O things of the same class and adds them as a single entity.*
Another way would be to use the CRFClassifier method call:
List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)
which will give you whole entities, which you can extract the String form of by using substring on the original input.
*The models that we distribute use a simple raw IO label scheme, where things are labeled PERSON or LOCATION, and the appropriate thing to do is simply to coalesce adjacent tokens with the same label. Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012).
The counterpart of the classifyToCharacterOffsets method is that (AFAIK) you can't access the label of the entities.
As proposed by Christopher, here is an example of a loop which assembles "adjacent non-O things". This example also counts the number of occurrences.
public HashMap<String, HashMap<String, Integer>> extractEntities(String text){
HashMap<String, HashMap<String, Integer>> entities =
new HashMap<String, HashMap<String, Integer>>();
for (List<CoreLabel> lcl : classifier.classify(text)) {
Iterator<CoreLabel> iterator = lcl.iterator();
if (!iterator.hasNext())
continue;
CoreLabel cl = iterator.next();
while (iterator.hasNext()) {
String answer =
cl.getString(CoreAnnotations.AnswerAnnotation.class);
if (answer.equals("O")) {
cl = iterator.next();
continue;
}
if (!entities.containsKey(answer))
entities.put(answer, new HashMap<String, Integer>());
String value = cl.getString(CoreAnnotations.ValueAnnotation.class);
while (iterator.hasNext()) {
cl = iterator.next();
if (answer.equals(
cl.getString(CoreAnnotations.AnswerAnnotation.class)))
value = value + " " +
cl.getString(CoreAnnotations.ValueAnnotation.class);
else {
if (!entities.get(answer).containsKey(value))
entities.get(answer).put(value, 0);
entities.get(answer).put(value,
entities.get(answer).get(value) + 1);
break;
}
}
if (!iterator.hasNext())
break;
}
}
return entities;
}
I had the same problem, so I looked it up, too. The method proposed by Christopher Manning is efficient, but the delicate point is to know how to decide which kind of separator is appropriate. One could say only a space should be allowed, e.g. "John Zorn" >> one entity. However, I may find the form "J.Zorn", so I should also allow certain punctuation marks. But what about "Jack, James and Joe" ? I might get 2 entities instead of 3 ("Jack James" and "Joe").
By digging a bit in the Stanford NER classes, I actually found a proper implementation of this idea. They use it to export entities under the form of single String objects. For instance, in the method PlainTextDocumentReaderAndWriter.printAnswersTokenizedInlineXML, we have:
private void printAnswersInlineXML(List<IN> doc, PrintWriter out) {
final String background = flags.backgroundSymbol;
String prevTag = background;
for (Iterator<IN> wordIter = doc.iterator(); wordIter.hasNext();) {
IN wi = wordIter.next();
String tag = StringUtils.getNotNullString(wi.get(AnswerAnnotation.class));
String before = StringUtils.getNotNullString(wi.get(BeforeAnnotation.class));
String current = StringUtils.getNotNullString(wi.get(CoreAnnotations.OriginalTextAnnotation.class));
if (!tag.equals(prevTag)) {
if (!prevTag.equals(background) && !tag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
} else if (!prevTag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
} else if (!tag.equals(background)) {
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
}
} else {
out.print(before);
}
out.print(current);
String afterWS = StringUtils.getNotNullString(wi.get(AfterAnnotation.class));
if (!tag.equals(background) && !wordIter.hasNext()) {
out.print("</");
out.print(tag);
out.print('>');
prevTag = background;
} else {
prevTag = tag;
}
out.print(afterWS);
}
}
They iterate over each word, checking if it has the same class (answer) than the previous, as explained before. For this, they take advantage of the fact expressions considered as not being entities are flagged using the so-called backgroundSymbol (class "O"). They also use the property BeforeAnnotation, which represents the string separating the current word from the previous one. This last point allows solving the problem I initially raised, regarding the choice of an appropriate separator.
Code for the above:
<List> result = classifier.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> triple : result)
{
System.out.println(triple.first + " : " + text.substring(triple.second, triple.third));
}
List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
String s = "";
String prevLabel = null;
for (CoreLabel word : sentence) {
if(prevLabel == null || prevLabel.equals(word.get(CoreAnnotations.AnswerAnnotation.class)) ) {
s = s + " " + word;
prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
}
else {
if(!prevLabel.equals("O"))
System.out.println(s.trim() + '/' + prevLabel + ' ');
s = " " + word;
prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
}
}
if(!prevLabel.equals("O"))
System.out.println(s + '/' + prevLabel + ' ');
}
I just wrote a small logic and it's working fine. what I did is group words with same label if they are adjacent.
Make use of the classifiers already provided to you. I believe this is what you are looking for:
private static String combineNERSequence(String text) {
String serializedClassifier = "edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = null;
try {
classifier = CRFClassifier
.getClassifier(serializedClassifier);
} catch (ClassCastException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(classifier.classifyWithInlineXML(text));
// FOR TSV FORMAT //
//System.out.print(classifier.classifyToString(text, "tsv", false));
return classifier.classifyWithInlineXML(text);
}
Here is my full code, I use Stanford core NLP and write algorithm to concatenate Multi Term names.
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import org.apache.log4j.Logger;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
/**
* Created by Chanuka on 8/28/14 AD.
*/
public class FindNameEntityTypeExecutor {
private static Logger logger = Logger.getLogger(FindNameEntityTypeExecutor.class);
private StanfordCoreNLP pipeline;
public FindNameEntityTypeExecutor() {
logger.info("Initializing Annotator pipeline ...");
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
pipeline = new StanfordCoreNLP(props);
logger.info("Annotator pipeline initialized");
}
List<String> findNameEntityType(String text, String entity) {
logger.info("Finding entity type matches in the " + text + " for entity type, " + entity);
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
List<String> matches = new ArrayList<String>();
for (CoreMap sentence : sentences) {
int previousCount = 0;
int count = 0;
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
int previousWordIndex;
if (entity.equals(token.get(CoreAnnotations.NamedEntityTagAnnotation.class))) {
count++;
if (previousCount != 0 && (previousCount + 1) == count) {
previousWordIndex = matches.size() - 1;
String previousWord = matches.get(previousWordIndex);
matches.remove(previousWordIndex);
previousWord = previousWord.concat(" " + word);
matches.add(previousWordIndex, previousWord);
} else {
matches.add(word);
}
previousCount = count;
}
else
{
count=0;
previousCount=0;
}
}
}
return matches;
}
}
Another approach to deal with multi words entities.
This code combines multiple tokens together if they have the same annotation and go in a row.
Restriction:
If the same token has two different annotations, the last one will be saved.
private Document getEntities(String fullText) {
Document entitiesList = new Document();
NERClassifierCombiner nerCombClassifier = loadNERClassifiers();
if (nerCombClassifier != null) {
List<List<CoreLabel>> results = nerCombClassifier.classify(fullText);
for (List<CoreLabel> coreLabels : results) {
String prevLabel = null;
String prevToken = null;
for (CoreLabel coreLabel : coreLabels) {
String word = coreLabel.word();
String annotation = coreLabel.get(CoreAnnotations.AnswerAnnotation.class);
if (!"O".equals(annotation)) {
if (prevLabel == null) {
prevLabel = annotation;
prevToken = word;
} else {
if (prevLabel.equals(annotation)) {
prevToken += " " + word;
} else {
prevLabel = annotation;
prevToken = word;
}
}
} else {
if (prevLabel != null) {
entitiesList.put(prevToken, prevLabel);
prevLabel = null;
}
}
}
}
}
return entitiesList;
}
Imports:
Document: org.bson.Document;
NERClassifierCombiner: edu.stanford.nlp.ie.NERClassifierCombiner;

Resources