Stanford coreNLP - split words ignoring apostrophe - nlp

I'm trying to split a sentence into words using Stanford coreNLP .
I'm having problem with words that contains apostrophe.
For example, the sentence:
I'm 24 years old.
Splits like this:
[I] ['m] [24] [years] [old]
Is it possible to split it like this using Stanford coreNLP?:
[I'm] [24] [years] [old]
I've tried using tokenize.whitespace, but it doesn't split on other punctuation marks like: '?' and ','

Currently, no. The subsequent Stanford CoreNLP processing tools all use Penn Treebank tokenization, which splits contractions into two tokens (regarding "I'm" as a reduced form of "I am" by making it the two "words" [I] ['m]). It sounds like you want a different type of tokenization.
While there are some tokenization options, there isn't one to change this, and subsequent tools (like the POS tagger or parser) would work badly without contractions being split. You could add such an option to the tokenizer, changing (deleting) the treatment of REDAUX and SREDAUX trailing contexts.
You can also join contractions via post processing as #dhg suggests, but you'd want to do it a little more carefully in the "if" so it didn't join on quotes.

How about if you just re-concatenate tokens that are split by an apostrophe?
Here's an implementation in Java:
public static List<String> tokenize(String s) {
PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<CoreLabel>(
new StringReader(s), new CoreLabelTokenFactory(), "");
List<String> sentence = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
for (CoreLabel label; ptbt.hasNext();) {
label = ptbt.next();
String word = label.word();
if (word.startsWith("'")) {
sb.append(word);
} else {
if (sb.length() > 0)
sentence.add(sb.toString());
sb = new StringBuilder();
sb.append(word);
}
}
if (sb.length() > 0)
sentence.add(sb.toString());
return sentence;
}
public static void main(String[] args) {
System.out.println(tokenize("I'm 24 years old.")); // [I'm, 24, years, old, .]
}

There are possessives and contractions. Your example is a contraction. Just looking for an apostrophe won't find you the difference between the two. "This is Pete's answer. I'm sure you knew that." In these two sentences we have one of each case.
With the part of speech tags we can tell the difference. With the tree surgeon syntax you can assemble those, change them and so forth. The syntax is listed here: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/tsurgeon/package-summary.html. I've found tree surgeon to be really useful in pulling apart NP groups as I like to break them up over conjunctions.
Alternatively, does 'm stem to "am"? You might want to look for those and look for it's stem tag and simply revert it to that value. Stemming is extremely useful in many other aspects of machine learning and analysis.

Related

Code about replacing certain words in discord.js

I was trying to make the bot replace multiple words in one sentence with another word.
ex: User will say "Today is a great day"
and the bot shall answer "Today is a bad night"
the words "great" and "day" were replaced by the words "bad" and "night" in this example.
I've been searching in order to find a similar code, but unfortunately all I could find is "word-blacklisting" scripts.
//I tried to do some coding with it but I am not an expert with node.js the code is written really badly. It's not even worth showing really.
The user will say some sentence and the bot will recognize some predetermined words on the sentence and will replace those words with other words I'll decide in the script
We can use String.replace() combined with Regular Expressions to match and replace single words of your choosing.
Consider this example:
function antonyms(string) {
return string
.replace(/(?<![A-Z])terrible(?![A-Z])/gi, 'great')
.replace(/(?<![A-Z])today(?![A-Z])/gi, 'tonight')
.replace(/(?<![A-Z])day(?![A-Z])/gi, 'night');
}
const original = 'Today is a tErRiBlE day.';
console.log(original);
const altered = antonyms(original);
console.log(altered);
const testStr = 'Daylight is approaching.'; // Note that this contains 'day' *within* a word.
const testRes = antonyms(testStr); // The lookarounds in the regex prevent replacement.
console.log(testRes); // If this isn't the desired behavior, you can remove them.

how to remove duplicate words in a string in matlab

consider I have this string
a='flexray_datain_flexray_sensors'
and I want to process this string to get
a='flexray_datain_sensors'
And the thing is this can be for any repeated words and not just flexray in matlab. If I already know what the word is then it's easy
I tried:
parts = textscan(bypname , '%s', 'delimiter', '_');
parts = parts{:};
and then processing this cell(parts) using unique or something and removing the repeated words. But I need a better answer .
Does this work for you?
strjoin(unique(strsplit(a,'_'),'stable'),'_')

JAPE rule Sentence contains multiple cases

How can i check whether a sentence contain combinations? For example consider sentence.
John appointed as new CEO for google.
I need to write a rule to check whether sentence contains < 'new' + 'Jobtitle' >.
How can i achieve this. I tried following. I need to check is there 'new' before word .
Rule: CustomRules
(
{
Sentence contains {Lookup.majorType == "organization"},
Sentence contains {Lookup.majorType == "jobtitle"},
Sentence contains {Lookup.majorType == "person_first"}
}
)
One way to handle this is to revert it. Focus on the sequence you need and then get the covering Sentence:
(
{Token#string == "new"}
{Lookup.majorType = "jobtitle"}
):newJT
You should check this edge when the Sentence starts after "new", like this:
new
CEO
You can use something like this:
{Token ... }
{!Sentence, Lookup.majorType ...}
And then get the sentence (if you really need it) in the java RHS:
long end = newJTAnnots.lastNode().getOffset();
long start = newJTAnnots.firstNode().getOffset();
AnnotationSet sentences = inputAS.get("Sentence", start, end);

Tell if specific char in string is a long char or a short char

Be prepared, this is one of those hard questions.
In Farsi or Persian language ی which sounds like y or i and is written in 4 different shapes according to it's place in word. I'll call ی as YA from now for simplification.
take a look at this image
All YA characters are painted in red, in the first word YA is attached to it's previous (right , in Farsi we right from RIGHT to LEFT) character and is free at the end whereas the last YA (3rd word, left-most red char) is free both from left or right.
Having said this long story, I want to find out if a part of a string ends with long YA (YA without points) or short YA (YA with two points beneath it).
i.e تحصیلداری (the 3rd word) ends with long YA but تحصیـ which is a part of 3rd word does not ends with short YA.
Question: How can I say تحصیلداری ends whit which unicode? I just have a simple string, "تحصیلداری", how can I convert its characters to unicode?
I tried the unicodes
string unicodes = "";
foreach (char c in "تحصیلداری")
{
unicodes += c+" "+((int)c).ToString() + Environment.NewLine;
}
MessageBox.Show(unicodes);
result :
but at the end of the day unfortunately all YAs have the same unicode.
Bad news : YA was an example, a real one though. There are also a dozen of other characters like YA with different appearances too.
Additional info :
using this useful link about unicodes I found unicode of different YAs
We solved similar problem the way bellow:
We had a core banking application, the customer sub-system needed a full text search on customers name, family, father name etc.
Different encoding, legacy migrated data, keyboard layouts and Farsi fonts ... made search process inaccurate.
We overcame the problem by replacing problematic characters with some standard one and saving the standard string for search purpose.
After several iterations, the replacement is as bellow that may come in handy:
Formula="UPPER(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(FirsName || LastName || FatherName,
chr(32),''),
chr(13),''),
chr(9),''),
chr(10),''),
'-',''),
'-',''),
'آ','ا'),
'أ', 'ا'),
'ئ', 'ي'),
'ي', 'ي'),
'ك', 'ک'),
'آإئؤةي','اايوهي'),
'ء',''),
'شأل','شاال'),
'ا.','اله'),
'.',''),
'الله','اله'),
'ؤ','و'),
'إ','ا'),
'ة','ه'),
' ا لله','اله'),
'ا لله','اله'),
' ا لله','اله'))"
Despite there are different YEHs in Unicode, it must noticed that all presentation forms of YEHs are same Unicode character with code 0x06cc. You can not determine presentation forms by their Unicode code.
But you can reach your goal be checking to see what characters is before or after YEH.
You can also use Fardis to see Unicode codes of strings.

How to reformat paragraph to have each sentence on a separate line?

Input:
Hi. I am John.
My name is John. Who are you ?
Output:
Hi
I am John
My name is John
Who are you
String line = "Hi. My name is John. Who are you ?";
String[] sentences = line.split("(?<=[.!?])\\s+");
for (String sentence : sentences) {
System.out.println("[" + sentence + "]");
}
This produces:
[Hi.]
[My name is John.]
[Who are you ?]
See also
regular-expressions.info tutorials
Lookarounds
Character classes
Java language guide: the for-each loop
If you're not comfortable using split (even though it's the recommended replacement for the "legacy" java.util.StringTokenizer), you can just use only java.util.Scanner (which is more than adequate to do the job).
See also
Scanner vs. StringTokenizer vs. String.Split
Here's a solution that uses Scanner, which by the way implements Iterator<String>. For extra instructional value, I'm also showing an example of using java.lang.Iterable<T> so that you can use the for-each construct.
final String text =
"Hi. I am John.\n" +
"My name is John. Who are you ?";
Iterable<String> sentences = new Iterable<String>() {
#Override public Iterator<String> iterator() {
return new Scanner(text).useDelimiter("\\s*[.!?]\\s*");
}
};
for (String sentence : sentences) {
System.out.println("[" + sentence + "]");
}
This prints:
[Hi]
[I am John]
[My name is John]
[Who are you]
If this regex is still not what you want, then I recommend investing the time to educate yourself so you can take matters into your own hand.
See also
What is the Iterable interface used for?
Why is Java’s Iterator not an Iterable?
Note: the final modifier for the local variable text in the above snippet is a necessity. In an illustrative example, it makes for a concise code, but in your actual code you should refactor the anonymous class to its own named class and have it take text in the constructor.
See also
Anonymous vs named inner classes? - best practices?
Cannot refer to a non-final variable inside an inner class defined in a different method

Resources