Limitation in Cassandra-0.8.1 when using batch mutation - cassandra

I found some exceptions from cassandra when I do batch mutation, it said "already has modifications in this mutation", but the info given are two different operations.
I use Super column with counters in this case, it's like
Key: md5 of urls, utf-8
SuperColumnName: date, utf-8
ColumnName: Counter name is a random number from 1 to 200,
ColumnValue:1L
L
public void SuperCounterMutation(ArrayList<String> urlList) {
LinkedList<HCounterSuperColumn<String, String>> counterSuperColumns;
for(String line : urlList) {
String[] ele = StringUtils.split(StringUtils.strip(line), ':');
String key = ele[0];
String SuperColumnName = ele[1];
LinkedList<HCounterColumn<String>> ColumnList = new LinkedList<HCounterColumn<String>>();
for(int i = 2; i < ele.length; ++i) {
ColumnList.add(HFactory.createCounterColumn(ele[i], 1L, ser));
}
mutator.addCounter(key, ColumnFamilyName, HFactory.createCounterSuperColumn(SuperColumnName, ColumnList, ser, ser));
++count;
if(count >= BUF_MAX_NUM) {
try {
mutator.execute();
} catch(Exception e) {
e.printStackTrace();
}
mutator = HFactory.createMutator(keyspace, ser);
count = 0;
}
}
return;
}
Error info from cassandra log showed that the duplicated operations have the same key only, SuperColumnName are not the same, and for counter name set, some conflicts have intersects and some not.
I'm using Cassandra 0.8.1 with hector 0.8.0-rc2
Can anyone tell me the reason of this problem? Thanks in advance!

Error info from cassandra log showed that the duplicated operations have the same key
Bingo. You'll need to combine operations from the same key into a single mutation.

Related

Datastax Java Driver failed to scan an entire table

I iterated over the entire table and received less partitions than expected.
Initially, I thought that it must be something wrong on my end, but after checking the existence of every row (I have a list of billions of keys with which I used) by using simple where query, and also verifying the expected number with the spark connector, I conclude that it can't be anything other than the driver.
I have billions of data rows, yet receiving half a billion less.
anyone else encountered this issue and was able to resolve it?
adding code snippet
The structure of the table is a simple counter table ,
CREATE TABLE counter_data (
id text,
name text,
count_val counter,
PRIMARY KEY (id, name)
) ;
public class CountTable {
private Session session;
private Statement countQuery;
public void initSession(String table) {
QueryOptions queryOptions = new QueryOptions();
queryOptions.setConsistencyLevel(ConsistencyLevel.ONE);
queryOptions.setFetchSize(100);
QueryLogger queryLogger = QueryLogger.builder().build();
Cluster cluster = Cluster.builder().addContactPoints("ip").withPort(9042)
.build();
cluster.register(queryLogger);
this.session = cluster.connect("ks");
this.countQuery = QueryBuilder.select("id").from(table);
}
public void performCount(){
ResultSet results = session.execute(countQuery);
int count = 0;
String lastKey = "";
results.iterator();
for (Row row : results) {
String key = row.getString(0);
if (!key.equals(lastKey)) {
lastKey = key;
count++;
}
}
session.close();
System.out.println("count is "+count);
}
public static void main(String[] args) {
CountTable countTable = new CountTable();
countTable.initSession("counter_data");
countTable.performCount();
}
}
Upon checking your code, the consistency level requested is ONE, compared to a dirty read in RDBMS world.
queryOptions.setConsistencyLevel(ConsistencyLevel.ONE);
For stronger consistency, that is to get back all records use local_quorum. Update your code as follows
queryOptions.setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
local_quorum guarantees that majority of the nodes in the replica (in your case 2 out of 3) respond to the read request and hence stronger consistency resulting in accurate number of rows. Here is documentation reference on consistency.

Saving DataSet<Row> to Ignite

Here is my code
public static void save(IgniteContext igniteContext, String cacheName, Dataset<Row> dataSet) {
CacheConfiguration<BinaryObject, BinaryObject> cacheConfiguration = new CacheConfiguration<BinaryObject, BinaryObject>(cacheName)
.setAtomicityMode(CacheAtomicityMode.ATOMIC)
.setBackups(0)
.setAffinity(new RendezvousAffinityFunction(false, 2))
.setIndexedTypes(BinaryObject.class, BinaryObject.class);
IgniteCache<BinaryObject, BinaryObject> rddCache = igniteContext.ignite()
.getOrCreateCache(cacheConfiguration)
.withKeepBinary();
rddCache.clear();
IgniteRDD<BinaryObject, BinaryObject> igniteRDD = igniteContext.fromCache(cacheName);
StructField[] fields = dataSet.schema().fields();
RDD<BinaryObject> binaryObjectJavaRDD = dataSet.toJavaRDD().map(row -> {
BinaryObjectBuilder valueBuilder = igniteContext.ignite().binary().builder(BinaryObject.class.getCanonicalName());
for (int i = 0; i < fields.length; i++) {
valueBuilder.setField(fields[i].name(), convertValue(String.valueOf(row.get(i)), fields[i].dataType())); //convertValue converts value to specific datatype
}
return valueBuilder.build();
}).rdd();
igniteRDD.saveValues(binaryObjectJavaRDD);
}
I have a problem with the above code, that is even after successful completion of this method cache remains empty. Dataset has 20 rows so that is not the problem.
The other problem is that if I use savePairs method from IgniteRDD then I have to generate the Key by myself(here Key is BinaryObject), so how to do that?
update
saveDFInPairs(IgniteContext igniteContext, Dataset<Row> dataSet, IgniteRDD<BinaryObject, BinaryObject> igniteRDD) {
StructField[] fields = dataSet.schema().fields();
JavaRDD<Tuple2<BinaryObject, BinaryObject>> rdd = dataSet.toJavaRDD().map(row -> {
BinaryObjectBuilder keyBuilder = igniteContext.ignite()
.binary().builder("TypeName");
keyBuilder.setField("id", row.mkString().hashCode());
BinaryObject key = keyBuilder.build();
BinaryObjectBuilder valueBuilder = igniteContext.ignite()
.binary().builder("TypeName");
for (int i = 0; i < fields.length; i++) {
valueBuilder.setField(fields[i].name(), convert(row, i, fields[i].dataType()));
}
BinaryObject value = valueBuilder.build();
return new Tuple2<>(key, value);
});
igniteRDD.savePairs(rdd.rdd(), true);
}
Couple of considerations:
The type name (the one passed to the builder() method) should be a meaningful name representing the data type. Do not use BinaryObject class name for this.
setIndexedTypes(BinaryObject.class, BinaryObject.class) is incorrect. This should specify classes to be processed for query annotations. If you don't have classes, you can use QueryEntity to configure queries. See this page for further details: https://apacheignite.readme.io/docs/sql-queries
Other than that code looks correct. I would recommend to try with default settings and check if it works this way. Also it's not very clear how you check that the data is in cache or not.

Azure Batch Insert: Bad Request Error

I am getting below error while trying to insert multiple entities in Azure Table storage:
com.microsoft.azure.storage.table.TableServiceException: Bad Request
at com.microsoft.azure.storage.table.TableBatchOperation$1.postProcessResponse(TableBatchOperation.java:525)
at com.microsoft.azure.storage.table.TableBatchOperation$1.postProcessResponse(TableBatchOperation.java:433)
at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:146)
Below is the Java code for batch insert:
public BatchInsertResponse batchInsert(BatchInsertRequest request){
BatchInsertResponse response = new BatchInsertResponse();
String erpName = request.getErpName();
HashMap<String,List<TableEntity>> tableNameToEntityMap = request.getTableNameToEntityMap();
HashMap<String,List<TableEntity>> errorMap = new HashMap<String,List<TableEntity>>();
HashMap<String,List<TableEntity>> successMap = new HashMap<String,List<TableEntity>>();;
CloudTable cloudTable=null;
for (Map.Entry<String, List<TableEntity>> entry : tableNameToEntityMap.entrySet()){
try {
cloudTable = azureStorage.getTable(entry.getKey());
} catch (Exception e) {
e.printStackTrace();
}
// Define a batch operation.
TableBatchOperation batchOperation = new TableBatchOperation();
List<TableEntity> value = entry.getValue();
for (int i = 0; i < value.size(); i++) {
TableEntity entity = value.get(i) ;
batchOperation.insertOrReplace(entity);
if (i!=0 && i % batchSize == 0) {
try {
cloudTable.execute(batchOperation);
batchOperation.clear();
} catch (Exception e) {
e.printStackTrace();
}
}
}
try {
cloudTable.execute(batchOperation);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Above code is working fine if I will assign batchSize value to 10 but if I will assign to 1000 or 100 it will throw Bad request error.
Please help me to resolve this error. I am using Spring boot and Azure-storage Java SDK version 4.3.0.
As Aravind mentioned, 400 error usually means there's something wrong with your data. From this link, an entity batch transaction will fail if one or more of the following conditions are not met:
All entities subject to operations as part of the transaction must have the same PartitionKey value.
An entity can appear only once in the transaction, and only one operation may be performed against it.
The transaction can include at most 100 entities, and its total payload may be no more than 4 MB in size.
All entities are subject to the limitations described in Understanding the Table Service Data Model.
Please check your entities against these four rules and ensure that you're not violating one of the rules.

In J2ME, How to re-index records in recordstore after deleting any record

I am developing a Location-based J2ME app & in that I'm using RMS to store data.
In RecordStore when I delete any record, the underlying records doesn't get re-indexed. For example, if I have 5 records & I delete record no.2 then record ids will be {1, 3, 4, 5}. But I want record ids after deletion to be {1, 2, 3, 4}. How should I do this??? Because recordId is playing an important role in my app to retrieve & update the record.
You need to change your application logic. ID is just for identification, and not for sorting. Because it is for identification, it must remains the same.
Very often the easiest thing to do is to read and write the whole recordstore at once.
So, since you've said that your record store is basically small (not that much data), I would recommend simply adding your own custom id field to each record. As Meier said, the RMS record id is not really meant to be recalculated, and changed, once a record has been created. So, I would use your own.
If each of your records contain:
boolean isMale
int age
String firstName
then, I would simply add another field at the start of each record:
int id
It makes your records a little bigger, but not much (4 bytes/record). If you'll have less than 64k records, then you could also use a short for the id, and save a couple bytes.
Here's an example (adapted from this IBM tutorial), of reading, writing, and deleting with this kind of record:
private RecordStore _rs;
// these next two methods are just small optimizations, to allow reading and
// updating the ID field in a record without the overhead of creating a new
// stream to call readInt() on. this assumes the id is a 4 byte int, written
// as the first field in each record.
/** Update one record with a new id field */
private static final void putIdIntoRecord(int id, byte[] record) {
// we assume the first 4 bytes are the id (int)
record[0] = (byte)(id >> 24);
record[1] = (byte)(id >> 16);
record[2] = (byte)(id >> 8);
record[3] = (byte)id;
}
/** Get the id field from one record */
private static final int getIdFromRecord(byte[] record) {
// we assume the first 4 bytes are the id (int)
return ((0xFF & record[0]) << 24) |
((0xFF & record[1]) << 16) |
((0xFF & record[2]) << 8) |
(0xFF & record[3]);
}
/** delete a record with the given (custom) id, re-indexing records afterwards */
private void delete(int idToDelete) {
try {
RecordEnumeration enumerator = _rs.enumerateRecords(new IdEqualToFilter(idToDelete),
null, false);
_rs.deleteRecord(enumerator.nextRecordId());
// now, re-index records after 'idToDelete'
enumerator = _rs.enumerateRecords(new IdGreaterThanFilter(idToDelete), null, true);
while (enumerator.hasNextElement()) {
int recordIdToUpdate = enumerator.nextRecordId();
byte[] record = _rs.getRecord(recordIdToUpdate);
// decrement the id by 1
int newId = getIdFromRecord(record) - 1;
// copy the new id back into the record
putIdIntoRecord(newId, record);
// update the record, which now has a lower id, in the store
_rs.setRecord(recordIdToUpdate, record, 0, record.length);
}
} catch (RecordStoreNotOpenException e) {
e.printStackTrace();
} catch (InvalidRecordIDException e) {
e.printStackTrace();
} catch (RecordStoreException e) {
e.printStackTrace();
}
}
/** generate some record store data ... example of writing to store */
public void writeTestData()
{
// just put 20 random records into the record store
boolean[] booleans = new boolean[20];
int[] integers = new int[20];
String[] strings = new String[20];
for (int i = 0; i < 20; i++) {
booleans[i] = (i % 2 == 1);
integers[i] = i * 2;
strings[i] = "string-" + i;
}
writeRecords(booleans, integers, strings);
}
/** take the supplied arrays of data, and save a record for each array index */
public void writeRecords(boolean[] bData, int[] iData, String[] sData)
{
try
{
// Write data into an internal byte array
ByteArrayOutputStream strmBytes = new ByteArrayOutputStream();
// Write Java data types into the above byte array
DataOutputStream strmDataType = new DataOutputStream(strmBytes);
byte[] record;
for (int i = 0; i < sData.length; i++)
{
// Write Java data types
strmDataType.writeInt(i); // this will be the ID field!
strmDataType.writeBoolean(bData[i]);
strmDataType.writeInt(iData[i]);
strmDataType.writeUTF(sData[i]);
// Clear any buffered data
strmDataType.flush();
// Get stream data into byte array and write record
record = strmBytes.toByteArray();
_rs.addRecord(record, 0, record.length);
// Toss any data in the internal array so writes
// starts at beginning (of the internal array)
strmBytes.reset();
}
strmBytes.close();
strmDataType.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
/** read in all the records, and print them out */
public void readRecords()
{
try
{
RecordEnumeration re = _rs.enumerateRecords(null, null, false);
while (re.hasNextElement())
{
// Get next record
byte[] recData = re.nextRecord();
// Read from the specified byte array
ByteArrayInputStream strmBytes = new ByteArrayInputStream(recData);
// Read Java data types from the above byte array
DataInputStream strmDataType = new DataInputStream(strmBytes);
// Read back the data types
System.out.println("Record ID=" + strmDataType.readInt());
System.out.println("Boolean: " + strmDataType.readBoolean());
System.out.println("Integer: " + strmDataType.readInt());
System.out.println("String: " + strmDataType.readUTF());
System.out.println("--------------------");
strmBytes.close();
strmDataType.close();
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
Here, I make use of a couple small RecordFilter classes, to use when searching the record store:
/** helps filter out records greater than a certain id */
private class IdGreaterThanFilter implements RecordFilter {
private int _minimumId;
public IdGreaterThanFilter(int value) {
_minimumId = value;
}
public boolean matches(byte[] candidate) {
// return true if candidate record's id is greater than minimum value
return (getIdFromRecord(candidate) > _minimumId);
}
}
/** helps filter out records by id field (not "recordId"!) */
private class IdEqualToFilter implements RecordFilter {
private int _id;
public IdEqualToFilter(int value) {
_id = value;
}
public boolean matches(byte[] candidate) {
// return true if candidate record's id matches
return (getIdFromRecord(candidate) == _id);
}
}

Multi-term named entities in Stanford Named Entity Recognizer

I'm using the Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml and it's working fine. This is
List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) {
namedEntities.add(word.word().trim());
}
}
}
However the problem I'm finding is identifying names and surnames. If the recognizer encounters "Joe Smith", it is returning "Joe" and "Smith" separately. I'd really like it to return "Joe Smith" as one term.
Could this be achieved through the recognizer maybe through a configuration? I didn't find anything in the javadoc till now.
Thanks!
This is because your inner for loop is iterating over individual tokens (words) and adding them separately. You need to change things to add whole names at once.
One way is to replace the inner for loop with a regular for loop with a while loop inside it which takes adjacent non-O things of the same class and adds them as a single entity.*
Another way would be to use the CRFClassifier method call:
List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)
which will give you whole entities, which you can extract the String form of by using substring on the original input.
*The models that we distribute use a simple raw IO label scheme, where things are labeled PERSON or LOCATION, and the appropriate thing to do is simply to coalesce adjacent tokens with the same label. Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012).
The counterpart of the classifyToCharacterOffsets method is that (AFAIK) you can't access the label of the entities.
As proposed by Christopher, here is an example of a loop which assembles "adjacent non-O things". This example also counts the number of occurrences.
public HashMap<String, HashMap<String, Integer>> extractEntities(String text){
HashMap<String, HashMap<String, Integer>> entities =
new HashMap<String, HashMap<String, Integer>>();
for (List<CoreLabel> lcl : classifier.classify(text)) {
Iterator<CoreLabel> iterator = lcl.iterator();
if (!iterator.hasNext())
continue;
CoreLabel cl = iterator.next();
while (iterator.hasNext()) {
String answer =
cl.getString(CoreAnnotations.AnswerAnnotation.class);
if (answer.equals("O")) {
cl = iterator.next();
continue;
}
if (!entities.containsKey(answer))
entities.put(answer, new HashMap<String, Integer>());
String value = cl.getString(CoreAnnotations.ValueAnnotation.class);
while (iterator.hasNext()) {
cl = iterator.next();
if (answer.equals(
cl.getString(CoreAnnotations.AnswerAnnotation.class)))
value = value + " " +
cl.getString(CoreAnnotations.ValueAnnotation.class);
else {
if (!entities.get(answer).containsKey(value))
entities.get(answer).put(value, 0);
entities.get(answer).put(value,
entities.get(answer).get(value) + 1);
break;
}
}
if (!iterator.hasNext())
break;
}
}
return entities;
}
I had the same problem, so I looked it up, too. The method proposed by Christopher Manning is efficient, but the delicate point is to know how to decide which kind of separator is appropriate. One could say only a space should be allowed, e.g. "John Zorn" >> one entity. However, I may find the form "J.Zorn", so I should also allow certain punctuation marks. But what about "Jack, James and Joe" ? I might get 2 entities instead of 3 ("Jack James" and "Joe").
By digging a bit in the Stanford NER classes, I actually found a proper implementation of this idea. They use it to export entities under the form of single String objects. For instance, in the method PlainTextDocumentReaderAndWriter.printAnswersTokenizedInlineXML, we have:
private void printAnswersInlineXML(List<IN> doc, PrintWriter out) {
final String background = flags.backgroundSymbol;
String prevTag = background;
for (Iterator<IN> wordIter = doc.iterator(); wordIter.hasNext();) {
IN wi = wordIter.next();
String tag = StringUtils.getNotNullString(wi.get(AnswerAnnotation.class));
String before = StringUtils.getNotNullString(wi.get(BeforeAnnotation.class));
String current = StringUtils.getNotNullString(wi.get(CoreAnnotations.OriginalTextAnnotation.class));
if (!tag.equals(prevTag)) {
if (!prevTag.equals(background) && !tag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
} else if (!prevTag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
} else if (!tag.equals(background)) {
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
}
} else {
out.print(before);
}
out.print(current);
String afterWS = StringUtils.getNotNullString(wi.get(AfterAnnotation.class));
if (!tag.equals(background) && !wordIter.hasNext()) {
out.print("</");
out.print(tag);
out.print('>');
prevTag = background;
} else {
prevTag = tag;
}
out.print(afterWS);
}
}
They iterate over each word, checking if it has the same class (answer) than the previous, as explained before. For this, they take advantage of the fact expressions considered as not being entities are flagged using the so-called backgroundSymbol (class "O"). They also use the property BeforeAnnotation, which represents the string separating the current word from the previous one. This last point allows solving the problem I initially raised, regarding the choice of an appropriate separator.
Code for the above:
<List> result = classifier.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> triple : result)
{
System.out.println(triple.first + " : " + text.substring(triple.second, triple.third));
}
List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
String s = "";
String prevLabel = null;
for (CoreLabel word : sentence) {
if(prevLabel == null || prevLabel.equals(word.get(CoreAnnotations.AnswerAnnotation.class)) ) {
s = s + " " + word;
prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
}
else {
if(!prevLabel.equals("O"))
System.out.println(s.trim() + '/' + prevLabel + ' ');
s = " " + word;
prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
}
}
if(!prevLabel.equals("O"))
System.out.println(s + '/' + prevLabel + ' ');
}
I just wrote a small logic and it's working fine. what I did is group words with same label if they are adjacent.
Make use of the classifiers already provided to you. I believe this is what you are looking for:
private static String combineNERSequence(String text) {
String serializedClassifier = "edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = null;
try {
classifier = CRFClassifier
.getClassifier(serializedClassifier);
} catch (ClassCastException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(classifier.classifyWithInlineXML(text));
// FOR TSV FORMAT //
//System.out.print(classifier.classifyToString(text, "tsv", false));
return classifier.classifyWithInlineXML(text);
}
Here is my full code, I use Stanford core NLP and write algorithm to concatenate Multi Term names.
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import org.apache.log4j.Logger;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
/**
* Created by Chanuka on 8/28/14 AD.
*/
public class FindNameEntityTypeExecutor {
private static Logger logger = Logger.getLogger(FindNameEntityTypeExecutor.class);
private StanfordCoreNLP pipeline;
public FindNameEntityTypeExecutor() {
logger.info("Initializing Annotator pipeline ...");
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
pipeline = new StanfordCoreNLP(props);
logger.info("Annotator pipeline initialized");
}
List<String> findNameEntityType(String text, String entity) {
logger.info("Finding entity type matches in the " + text + " for entity type, " + entity);
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
List<String> matches = new ArrayList<String>();
for (CoreMap sentence : sentences) {
int previousCount = 0;
int count = 0;
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
int previousWordIndex;
if (entity.equals(token.get(CoreAnnotations.NamedEntityTagAnnotation.class))) {
count++;
if (previousCount != 0 && (previousCount + 1) == count) {
previousWordIndex = matches.size() - 1;
String previousWord = matches.get(previousWordIndex);
matches.remove(previousWordIndex);
previousWord = previousWord.concat(" " + word);
matches.add(previousWordIndex, previousWord);
} else {
matches.add(word);
}
previousCount = count;
}
else
{
count=0;
previousCount=0;
}
}
}
return matches;
}
}
Another approach to deal with multi words entities.
This code combines multiple tokens together if they have the same annotation and go in a row.
Restriction:
If the same token has two different annotations, the last one will be saved.
private Document getEntities(String fullText) {
Document entitiesList = new Document();
NERClassifierCombiner nerCombClassifier = loadNERClassifiers();
if (nerCombClassifier != null) {
List<List<CoreLabel>> results = nerCombClassifier.classify(fullText);
for (List<CoreLabel> coreLabels : results) {
String prevLabel = null;
String prevToken = null;
for (CoreLabel coreLabel : coreLabels) {
String word = coreLabel.word();
String annotation = coreLabel.get(CoreAnnotations.AnswerAnnotation.class);
if (!"O".equals(annotation)) {
if (prevLabel == null) {
prevLabel = annotation;
prevToken = word;
} else {
if (prevLabel.equals(annotation)) {
prevToken += " " + word;
} else {
prevLabel = annotation;
prevToken = word;
}
}
} else {
if (prevLabel != null) {
entitiesList.put(prevToken, prevLabel);
prevLabel = null;
}
}
}
}
}
return entitiesList;
}
Imports:
Document: org.bson.Document;
NERClassifierCombiner: edu.stanford.nlp.ie.NERClassifierCombiner;

Resources