String comparison not working for sharepoint multiline text values - sharepoint

I am fetching data from sharepoint list for a multi line column.
And then split the data by space and comparing it to other string but despite the value in both the strings being same it gives false result.
Please follow the below code:
string[] strBodys = SPHttpUtility.ConvertSimpleHtmlToText(Convert.ToString(workflowProperties.ListItem[SCMSConstants.lstfldBody]), Convert.ToString(workflowProperties.ListItem[SCMSConstants.lstfldBody]).Length).Split(' ');
bool hasKwrdInBody = false;
foreach (SPItem oItem in oColl)
{//get all the keywords
string[] strkeyWrds = SPHttpUtility.ConvertSimpleHtmlToText(Convert.ToString(oItem[SCMSConstants.lstfldKWConfigKeywordsIntrName]), Convert.ToString(oItem[SCMSConstants.lstfldKWConfigKeywordsIntrName]).Length).Split(',');
//in body
foreach (string strKW in strkeyWrds)
{
string KWValue = strKW.Trim(' ').ToLower();
foreach (string strBdy in strBodys)
{
string BodyValue = strBdy.Trim(' ').ToLower();
//if (strKW.ToLower().Equals(strBdy.ToLower()))
if(KWValue == BodyValue) //here it always gives false result
{
hasKwrdInBody = true;
break;
}
}
if (hasKwrdInBody)
break;
}
if (!hasKwrdInSbjct && !hasKwrdInBody)
{
continue;
}
else
{
//set business unit to current groups rule
bsnsUnitLookupFld = new SPFieldLookupValue(Convert.ToString(oItem[SCMSConstants.lstfldBsnsUnit]));
asgndTo = new SPFieldUserValue(objWeb,Convert.ToString(oItem[SCMSConstants.lstfldKWConfigAssignedToIntrName])).User;
groupName = Convert.ToString(oItem[SCMSConstants.lstfldKWConfigAssignedToGroupIntrName]).Split('#').Last();
break;
}
}
Please mind that i am trying to get multi line text from sharepoint list
Please provide your suggestions.

That also depends on the exact type of your Multiline field (e.g Plain Text or RichText, etc.).
Maybe it would be clear if you just added some logging writing out the values you are comparing.
For details on how to get the value of a Multiline textfield check Accessing Multiple line of text programmatically
and here for RichText

I got it working by comparing and counting the characters in both the strings. Actually some UTC codes were embedded in to the string. First I removed those characters using regular expression and then compared them and it worked like a charm.
Here is the code snippet, might help some one.
string[] strBodys = SPHttpUtility.ConvertSimpleHtmlToText(Convert.ToString(workflowProperties.ListItem[SCMSConstants.lstfldBody]), Convert.ToString(workflowProperties.ListItem[SCMSConstants.lstfldBody]).Length).Split(' ');
bool hasKwrdInBody = false;
foreach (SPItem oItem in oColl)
{//get all the keywords
string[] strkeyWrds = SPHttpUtility.ConvertSimpleHtmlToText(Convert.ToString(oItem[SCMSConstants.lstfldKWConfigKeywordsIntrName]), Convert.ToString(oItem[SCMSConstants.lstfldKWConfigKeywordsIntrName]).Length).Split(',');
//in body
foreach (string strKW in strkeyWrds)
{
string KWValue = strKW.Trim(' ').ToLower();
KWValue = Regex.Replace(KWValue, #"[^\u0000-\u007F]", string.Empty); //here replaced the utc codes
foreach (string strBdy in strBodys)
{
string BodyValue = strBdy.Trim(' ').ToLower();
BodyValue = Regex.Replace(BodyValue, #"\t|\n|\r", string.Empty); // new code to replace utc code
BodyValue = Regex.Replace(BodyValue, #"[^\u0000-\u007F]", string.Empty); //new code to replace utc code
//if (strKW.ToLower().Equals(strBdy.ToLower()))
if(KWValue == BodyValue) //here it always gives false result
{
hasKwrdInBody = true;
break;
}
}
if (hasKwrdInBody)
break;
}
if (!hasKwrdInSbjct && !hasKwrdInBody)
{
continue;
}
else
{
//set business unit to current groups rule
bsnsUnitLookupFld = new SPFieldLookupValue(Convert.ToString(oItem[SCMSConstants.lstfldBsnsUnit]));
asgndTo = new SPFieldUserValue(objWeb,Convert.ToString(oItem[SCMSConstants.lstfldKWConfigAssignedToIntrName])).User;
groupName = Convert.ToString(oItem[SCMSConstants.lstfldKWConfigAssignedToGroupIntrName]).Split('#').Last();
break;
}
}

Related

WordProcessingDocument not preserving whitespace

I'm writing a C# program using XML and Linq that reads in data from tables stored in a word document and inserts it into an excel spreadsheet. The code I have so far does this, however it does not preserve any new lines (in the word doc the "new line" is done by pressing the enter key). Using the debugger, I can see that the new lines aren't even being read in. For example, if the text I want to copy is:
Something like this
And another line
And maybe even a third line
It gets read in as:
Something like thisAnd another lineAnd maybe even a third line
I can't separate the lines by a character as the words could be anything. This is what I have so far:
internal override Dictionary<string, string> GetContent()
{
Dictionary<string, string> contents = new Dictionary<string, string>();
using (WordprocessingDocument doc = WordprocessingDocument.Open(MainForm.WordFileDialog.FileName, false))
{
List<Table> tables = doc.MainDocumentPart.Document.Descendants<Table>().ToList();
foreach (Table table in tables)
{
TableRow headerRow = table.Elements<TableRow>().ElementAt(0);
TableCell tableSectionTitle;
try
{
tableSectionTitle = headerRow.Elements<TableCell>().ElementAt(0);
}
catch (ArgumentOutOfRangeException)
{
continue;
}
List<TableRow> rows = table.Descendants<TableRow>().ToList();
foreach (TableRow row in rows)
{
TableCell headerCell = row.Elements<TableCell>().ElementAt(0);
if (headerCell.InnerText.ToLower().Contains("first item"))
{
contents.Add("first item", row.Elements<TableCell>().ElementAt(1).InnerText);
}
else if (headerCell.InnerText.ToLower().Contains("second item:"))
{
char[] split = { ':' };
Int32 count = 2;
string str = row.Elements<TableCell>().ElementAt(0).InnerText;
String[] newStr = str.Split(split, count, StringSplitOptions.None);
contents.Add("second item:", newStr[1]);
}
**continues for many more else if statements**
else
{
continue;
}
}
}
return contents;
}
}
I'm new to using XML, so any help would be appreciated!

Export Rich Text to plain text c#

Good day to Stackoverflow community,
I am in need of some expert assistance. I have an MVC4 web app that has a few rich text box fields powered by TinyMCE. Up until now the system is working great. Last week my client informed me that they want to export the data stored in Microsoft SQL to Excel to run custom reports.
I am able to export the data to excel with the code supplied. However it is exporting the data in RTF rather than Plain text. This is causing issues when they try to read the content.
Due to lack of knowledge and or understanding I am unable to figure this out. I did read that it is possible to use regex to do this however I have no idea how to implement this. So I turn to you for assistance.
public ActionResult ExportReferralData()
{
GridView gv = new GridView();
gv.DataSource = db.Referrals.ToList();
gv.DataBind();
Response.ClearContent();
Response.Buffer = true;
Response.AddHeader("content-disposition", "attachment; filename=UnderwritingReferrals.xls");
Response.ContentType = "application/ms-excel";
Response.AddHeader("Content-Type", "application/vnd.ms-excel");
Response.Charset = "";
Response.Cache.SetCacheability(HttpCacheability.NoCache);
StringWriter sw = new StringWriter();
HtmlTextWriter htw = new HtmlTextWriter(sw);
gv.RenderControl(htw);
Response.Output.Write(sw.ToString());
Response.Flush();
Response.End();
return RedirectToAction("Index");
}
I would really appreciate any assistance. and thank you in advance.
I have looked for solutions on YouTube and web forums with out any success.
Kind Regards
Francois Muller
One option you can perform is to massage the Data you write to the XML file.
For example, idenfity in your string and replace it with string.Empty.
Similarly can be replaced with string.Empty.
Once you have identified all the variants of the Rich Text HTML tags, you can just create a list of the Tags, and inside a for FOR loop replace each of them with a suitable string.
Did you try saving the file as .xslx and sending over to the client.
The newer Excel format might handle the data more gracefully?
Add this function to your code, and then you can invoke the function passing it in the HTML string. The return output will be HTML free.
Warning: This does not work for all cases and should not be used to process untrusted user input. Please test it with variants of your input string.
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{ inside = true; continue; }
if (let == '>') { inside = false; continue; }
if (!inside) { array[arrayIndex] = let; arrayIndex++; }
}
return new string(array, 0, arrayIndex);
}
So I managed to resolve this issue by changing the original code as follow:
As I'm only trying to convert a few columns, I found this to be working well. This will ensure each records is separated by row in Excel and converts the Html to plain text allowing users to add column filters in Excel.
I hope this helps any one else that has a similar issue.
GridView gv = new GridView();
var From = RExportFrom;
var To = RExportTo;
if (RExportFrom == null || RExportTo == null)
{
/* The actual code to be used */
gv.DataSource = db.Referrals.OrderBy(m =>m.Date_Logged).ToList();
}
else
{
gv.DataSource = db.Referrals.Where(m => m.Date_Logged >= From && m.Date_Logged <= To).OrderBy(m => m.Date_Logged).ToList();
}
gv.DataBind();
foreach (GridViewRow row in gv.Rows)
{
if (row.Cells[20].Text.Contains("<"))
{
row.Cells[20].Text = Regex.Replace(row.Cells[20].Text, "<(?<tag>.+?)(>|>)", " ");
}
if (row.Cells[21].Text.Contains("<"))
{
row.Cells[21].Text = Regex.Replace(row.Cells[21].Text, "<(?<tag>.+?)(>|>)", " ");
}
if (row.Cells[22].Text.Contains("<"))
{
row.Cells[22].Text = Regex.Replace(row.Cells[22].Text, "<(?<tag>.+?)(>|>)", " ");
}
if (row.Cells[37].Text.Contains("<"))
{
row.Cells[37].Text = Regex.Replace(row.Cells[37].Text, "<(?<tag>.+?)(>|>)", " ");
}
if (row.Cells[50].Text.Contains("<"))
{
row.Cells[50].Text = Regex.Replace(row.Cells[37].Text, "<(?<tag>.+?)(>|>)", " ");
}
}
Response.ClearContent();
Response.Buffer = true;
Response.AddHeader("content-disposition", "attachment; filename=Referrals " + DateTime.Now.ToString("dd/MM/yyyy") + ".xls");
Response.ContentType = "application/ms-excel";
Response.ContentEncoding = System.Text.Encoding.UTF8;
Response.AddHeader("Content-Type", "application/vnd.ms-excel");
Response.Charset = "";
Response.Cache.SetCacheability(HttpCacheability.NoCache);
StringWriter sw = new StringWriter();
HtmlTextWriter htw = new HtmlTextWriter(sw);
gv.RenderControl(htw);
//This code will export the data to Excel and remove all HTML Tags to pass everything into Plain text.
//I am using HttpUtility.HtmlDecode twice as the first instance changes null values to "Â" the second time it will run the replace code.
//I am using Regex.Replace to change the headings to more understandable headings rather than the headings produced by the Model.
Response.Write(HttpUtility.HtmlDecode(sw.ToString())
.Replace("Cover_Details", "Referral Detail")
.Replace("Id", "Identity Number")
.Replace("Unique_Ref", "Reference Number")
.Replace("Date_Logged", "Date Logged")
.Replace("Logged_By", "File Number")
.Replace("Date_Referral", "Date of Referral")
.Replace("Referred_By", "Name of Referrer")
.Replace("UWRules", "Underwriting Rules")
.Replace("Referred_To", "Name of Referrer")
);
Response.Flush();
Response.End();
TempData["success"] = "Data successfully exported!";
return RedirectToAction("Index");
}

How to pass multiple list types as a parameter using the same method variable

I'm trying to pass multiple list types as a parameter using the same method variable and then loop through the types based on which type as been past. I tried using a generic method but it's not working. Below are pseudo/example codes. The List SAS_F_DISAGG_F and List SAS_C_DISAGG_C are SQL/Entity, and the List DisaggReportGroups is a class object. I'm trying to pass the entity lists.
protected void GetReportGroup()
{
DisaggReportGroups rptGroup = new DisaggReportGroups();
List<DisaggReportGroups> disagreportGroup = new List<DisaggReportGroups>();
disagreportGroup.Add(rptGroup);
DisaggregatedReportData disagReportData = new DisaggregatedReportData();
foreach (var reportGroup in disagreportGroup)
{
if (reportGroup.FuturesOnly == "Futures Only, " & reportGroup.Agriculture == "Agriculture")
{
List<SAS_F_DISAGG_F> futONlyDisagReportData = disagReportData.GetFuturesOnlyReportData(reportGroup.Agriculture).ToList();
CreateLongFormatReport<List<SAS_F_DISAGG_F>>(reportGroup.AgricultureFilenameFOLF, reportGroup.FuturesOnly, reportGroup.Agriculture, futONlyDisagReportData);
}
else if (reportGroup.FOCombined == "Futures and Options Combined, " & reportGroup.Agriculture == "Agriculture")
{
List<SAS_C_DISAGG_C> combinedDisagReportData = disagReportData.GetFOCombinedReportData(reportGroup.Agriculture).ToList();
CreateLongFormatReport<List<SAS_C_DISAGG_C>>(reportGroup.AgricultureFilenameFOCombinedLF, reportGroup.FOCombined, reportGroup.Agriculture, combinedDisagReportData);
}
}
}
protected void CreateFormatReport<T>(string filename, string disagCategory, string commSubGp, List<T> reportData)
{
using (FileStream fileStream = new FileStream(Server.MapPath(#"~/Includes/") + filename, FileMode.Create))
{
using (StreamWriter writer = new StreamWriter(fileStream))
{
foreach (var value in reportData)
{
string FuturesOnly = "Futures Only, ";
string FOCombined = "Futures and Options Combined, ";
string reportCategory = "";
if (disagCategory == FuturesOnly)
{
reportCategory = FuturesOnly;
}
else if (disagCategory == FOCombined)
{
reportCategory = FOCombined;
}
string row01 = String.Format("{0, -10}{1, 29}{2, 8}", value.MKTTITL.PadRight(120), "Code -", value.Conmkt);
string row02 = String.Format("{0, -10}{1, 7}{2, 14}", "Blah Blah - ", reportCategory, value.DAT1TITL);
string row03 = String.Format("{0, 3}{1, 3}{2, 8:0,0}{3, 3}{4, 8:0,0}{5, 11:0,0}{6, 11:0,0}{7, 11:0,0}{8, 11:0,0}{9, 13:0,0}{10, 11:0,0}{11, 11:0,0}{12, 13:0,0}{13, 10:0,0}{14, 9:0,0}{15, 3}{16, 8:0,0}{17, 10:0,0}", "All",
colon, value.TA01, colon, value.TA02, value.TA03, value.TA04, value.TA05, value.TA06, value.TA07, value.TA08, value.TA09, value.TA10, value.TA11, value.TA12, colon, value.TA15, value.TA16);
string row04 = String.Format("{0, 3}{1, 3}{2, 8:0,0}{3, 3}{4, 8:0,0}{5, 11:0,0}{6, 11:0,0}{7, 11:0,0}{8, 11:0.##}{9, 13:0,0}{10, 11:0,0}{11, 11:0,0}{12, 13:0,0}{13, 10:0,0}{14, 9:0,0}{15, 3}{16, 8:0,0}{17, 10:0,0}", "Old",
colon, value.TO01, colon, value.TO02, value.TO03, value.TO04, value.TO05, value.TO06, value.TO07, value.TO08, value.TO09, value.TO10, value.TO11, value.TO12, colon, value.TO15, value.TO16);
writer.Write(row01);
writer.WriteLine(row02);
writer.WriteLine(row03);
writer.WriteLine(row04);
} //end foreach
writer.Close();
} //end of stream writer
}
}
Thanks for your help.
I managed to solve this problem myself so I'm posting my solution for others that may need the same type of help. The solution is to use Reflection within the foreach iteration.
foreach (var value in ReportData)
{
//Reflection can be used
string TA01 = value.GetType().GetProperty("TA01").GetValue(value).ToString();
//...
//...
//do more stuff/coding...
}
Then in the String.Format change "value.TA01" to "TA01". Do the same for all other variables.
Hope this help.

How to Handle textbox values on sever side

I have three textboxes.In textbox1 and in textbox2 i entered a number Like ->
Textbox1-0123456789
Textbox2-0123-456-789
Textboxe3-0123-456-789
Now on server side i.e on aspx.cs page i need to check the numbers is it same or not and only one distinct number will be saved in database
//Get the values from text box and form a list
//validate against a regular expression to make them pure numeric
//now check if they are all same
List<string> lst = new List<string>()
{
"0123-456-A789",
"0123-456-A789",
"0123-456-789"
};
Regex rgx = new Regex("[^a-zA-Z0-9]");
//s1 = rgx.Replace(s1, "");
for (int i = 0; i < lst.Count; i++)
{
var value = lst[i];
value = rgx.Replace(value, "");
lst[i] = value;
}
if (lst.Any(num => num != lst[0]))
{
Console.WriteLine("All are not same");
}
else
{
Console.WriteLine("All are same");
}
//if all are same, pick an entry from the list
// if not throw error
HOPE THIS MAY GIVE U AN IDEA !!!!
If we apply replace("-","") than from every textbox it will remove dash.The number which is same like in
textbox1-0123456789
textbox2=0123-456-789
textbox3=678-908-999
than replace will remove dash from textbox3 also which we dont want.
so For this we have to apply not exists operation of linq.
List strMobileNos = new List();
Regex re = new Regex(#"\d{10}|\d{3}\s*-\s*\d{3}\s*-\s*\d{4}");
!strMobileNos.Exists(l => l.Replace("-", "") == Request.Form["txtMobNo2"].Replace("Mobile2", "").Replace("-", ""))

Multi-term named entities in Stanford Named Entity Recognizer

I'm using the Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml and it's working fine. This is
List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) {
namedEntities.add(word.word().trim());
}
}
}
However the problem I'm finding is identifying names and surnames. If the recognizer encounters "Joe Smith", it is returning "Joe" and "Smith" separately. I'd really like it to return "Joe Smith" as one term.
Could this be achieved through the recognizer maybe through a configuration? I didn't find anything in the javadoc till now.
Thanks!
This is because your inner for loop is iterating over individual tokens (words) and adding them separately. You need to change things to add whole names at once.
One way is to replace the inner for loop with a regular for loop with a while loop inside it which takes adjacent non-O things of the same class and adds them as a single entity.*
Another way would be to use the CRFClassifier method call:
List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)
which will give you whole entities, which you can extract the String form of by using substring on the original input.
*The models that we distribute use a simple raw IO label scheme, where things are labeled PERSON or LOCATION, and the appropriate thing to do is simply to coalesce adjacent tokens with the same label. Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012).
The counterpart of the classifyToCharacterOffsets method is that (AFAIK) you can't access the label of the entities.
As proposed by Christopher, here is an example of a loop which assembles "adjacent non-O things". This example also counts the number of occurrences.
public HashMap<String, HashMap<String, Integer>> extractEntities(String text){
HashMap<String, HashMap<String, Integer>> entities =
new HashMap<String, HashMap<String, Integer>>();
for (List<CoreLabel> lcl : classifier.classify(text)) {
Iterator<CoreLabel> iterator = lcl.iterator();
if (!iterator.hasNext())
continue;
CoreLabel cl = iterator.next();
while (iterator.hasNext()) {
String answer =
cl.getString(CoreAnnotations.AnswerAnnotation.class);
if (answer.equals("O")) {
cl = iterator.next();
continue;
}
if (!entities.containsKey(answer))
entities.put(answer, new HashMap<String, Integer>());
String value = cl.getString(CoreAnnotations.ValueAnnotation.class);
while (iterator.hasNext()) {
cl = iterator.next();
if (answer.equals(
cl.getString(CoreAnnotations.AnswerAnnotation.class)))
value = value + " " +
cl.getString(CoreAnnotations.ValueAnnotation.class);
else {
if (!entities.get(answer).containsKey(value))
entities.get(answer).put(value, 0);
entities.get(answer).put(value,
entities.get(answer).get(value) + 1);
break;
}
}
if (!iterator.hasNext())
break;
}
}
return entities;
}
I had the same problem, so I looked it up, too. The method proposed by Christopher Manning is efficient, but the delicate point is to know how to decide which kind of separator is appropriate. One could say only a space should be allowed, e.g. "John Zorn" >> one entity. However, I may find the form "J.Zorn", so I should also allow certain punctuation marks. But what about "Jack, James and Joe" ? I might get 2 entities instead of 3 ("Jack James" and "Joe").
By digging a bit in the Stanford NER classes, I actually found a proper implementation of this idea. They use it to export entities under the form of single String objects. For instance, in the method PlainTextDocumentReaderAndWriter.printAnswersTokenizedInlineXML, we have:
private void printAnswersInlineXML(List<IN> doc, PrintWriter out) {
final String background = flags.backgroundSymbol;
String prevTag = background;
for (Iterator<IN> wordIter = doc.iterator(); wordIter.hasNext();) {
IN wi = wordIter.next();
String tag = StringUtils.getNotNullString(wi.get(AnswerAnnotation.class));
String before = StringUtils.getNotNullString(wi.get(BeforeAnnotation.class));
String current = StringUtils.getNotNullString(wi.get(CoreAnnotations.OriginalTextAnnotation.class));
if (!tag.equals(prevTag)) {
if (!prevTag.equals(background) && !tag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
} else if (!prevTag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
} else if (!tag.equals(background)) {
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
}
} else {
out.print(before);
}
out.print(current);
String afterWS = StringUtils.getNotNullString(wi.get(AfterAnnotation.class));
if (!tag.equals(background) && !wordIter.hasNext()) {
out.print("</");
out.print(tag);
out.print('>');
prevTag = background;
} else {
prevTag = tag;
}
out.print(afterWS);
}
}
They iterate over each word, checking if it has the same class (answer) than the previous, as explained before. For this, they take advantage of the fact expressions considered as not being entities are flagged using the so-called backgroundSymbol (class "O"). They also use the property BeforeAnnotation, which represents the string separating the current word from the previous one. This last point allows solving the problem I initially raised, regarding the choice of an appropriate separator.
Code for the above:
<List> result = classifier.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> triple : result)
{
System.out.println(triple.first + " : " + text.substring(triple.second, triple.third));
}
List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
String s = "";
String prevLabel = null;
for (CoreLabel word : sentence) {
if(prevLabel == null || prevLabel.equals(word.get(CoreAnnotations.AnswerAnnotation.class)) ) {
s = s + " " + word;
prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
}
else {
if(!prevLabel.equals("O"))
System.out.println(s.trim() + '/' + prevLabel + ' ');
s = " " + word;
prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
}
}
if(!prevLabel.equals("O"))
System.out.println(s + '/' + prevLabel + ' ');
}
I just wrote a small logic and it's working fine. what I did is group words with same label if they are adjacent.
Make use of the classifiers already provided to you. I believe this is what you are looking for:
private static String combineNERSequence(String text) {
String serializedClassifier = "edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = null;
try {
classifier = CRFClassifier
.getClassifier(serializedClassifier);
} catch (ClassCastException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(classifier.classifyWithInlineXML(text));
// FOR TSV FORMAT //
//System.out.print(classifier.classifyToString(text, "tsv", false));
return classifier.classifyWithInlineXML(text);
}
Here is my full code, I use Stanford core NLP and write algorithm to concatenate Multi Term names.
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import org.apache.log4j.Logger;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
/**
* Created by Chanuka on 8/28/14 AD.
*/
public class FindNameEntityTypeExecutor {
private static Logger logger = Logger.getLogger(FindNameEntityTypeExecutor.class);
private StanfordCoreNLP pipeline;
public FindNameEntityTypeExecutor() {
logger.info("Initializing Annotator pipeline ...");
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
pipeline = new StanfordCoreNLP(props);
logger.info("Annotator pipeline initialized");
}
List<String> findNameEntityType(String text, String entity) {
logger.info("Finding entity type matches in the " + text + " for entity type, " + entity);
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
List<String> matches = new ArrayList<String>();
for (CoreMap sentence : sentences) {
int previousCount = 0;
int count = 0;
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
int previousWordIndex;
if (entity.equals(token.get(CoreAnnotations.NamedEntityTagAnnotation.class))) {
count++;
if (previousCount != 0 && (previousCount + 1) == count) {
previousWordIndex = matches.size() - 1;
String previousWord = matches.get(previousWordIndex);
matches.remove(previousWordIndex);
previousWord = previousWord.concat(" " + word);
matches.add(previousWordIndex, previousWord);
} else {
matches.add(word);
}
previousCount = count;
}
else
{
count=0;
previousCount=0;
}
}
}
return matches;
}
}
Another approach to deal with multi words entities.
This code combines multiple tokens together if they have the same annotation and go in a row.
Restriction:
If the same token has two different annotations, the last one will be saved.
private Document getEntities(String fullText) {
Document entitiesList = new Document();
NERClassifierCombiner nerCombClassifier = loadNERClassifiers();
if (nerCombClassifier != null) {
List<List<CoreLabel>> results = nerCombClassifier.classify(fullText);
for (List<CoreLabel> coreLabels : results) {
String prevLabel = null;
String prevToken = null;
for (CoreLabel coreLabel : coreLabels) {
String word = coreLabel.word();
String annotation = coreLabel.get(CoreAnnotations.AnswerAnnotation.class);
if (!"O".equals(annotation)) {
if (prevLabel == null) {
prevLabel = annotation;
prevToken = word;
} else {
if (prevLabel.equals(annotation)) {
prevToken += " " + word;
} else {
prevLabel = annotation;
prevToken = word;
}
}
} else {
if (prevLabel != null) {
entitiesList.put(prevToken, prevLabel);
prevLabel = null;
}
}
}
}
}
return entitiesList;
}
Imports:
Document: org.bson.Document;
NERClassifierCombiner: edu.stanford.nlp.ie.NERClassifierCombiner;

Resources