What is the best way of splitting Japanese text using Java?
For Example, for the below text:
こんにちは。私の名前はオバマです。私はアメリカに行く。
I need the following output:
こんにちは
私の名前はオバマです
私はアメリカに行く
Is it possible using Kuromoji?
You can use java.text.BreakIterator.
String TEXT = "こんにちは。私の名前はオバマです。私はアメリカに行く。";
BreakIterator boundary = BreakIterator.getSentenceInstance(Locale.JAPAN);
boundary.setText(TEXT);
int start = boundary.first();
for (int end = boundary.next();
end != BreakIterator.DONE;
start = end, end = boundary.next()) {
System.out.println(TEXT.substring(start, end));
}
The output of this program is:
こんにちは。
私の名前はオバマです。
私はアメリカに行く。
You cannot use Kuromoji to look for Japanese sentence boundaries. It can split a sentence into words.
Related
I'm attempting to do CS50 courses in dart, so for week 2 substitution test i'm stuck with this:
void main(List<String> args) {
String alphabet = 'abcdefghijklmnopqrstuvwxyz';
String cypher = 'qwertyuiopasdfghjklzxcvbnm';
int n = alphabet.length;
print('entertext:');
String text = stdin.readLineSync(encoding: utf8)!;
for (int i = 0; i < n; i++) {
text = text.replaceAll(alphabet[i], cypher[i]);
}
print(text);
}
Expected result: abcdef = qwerty
Actual result: jvmkmn
Any ideas why this is happening? I'm a total beginner by the way
It is because you at first substitute the letter a with the letter q, but when n = 16, you will replace all the letter q with the letter j. This is why your a is turned into a j, and so forth...
Best of luck to you :)
For the record, the (very direct and) safer approach would be:
void main(List<String> args) {
String alphabet = 'abcdefghijklmnopqrstuvwxyz';
String cypher = 'qwertyuiopasdfghjklzxcvbnm';
assert(alphabet.length == cypher.length);
// Pattern matching any character in `alphabet`.
var re = RegExp('[${RegExp.escape(alphabet)}]');
print('enter text:');
String text = stdin.readLineSync(encoding: utf8)!;
// Replace each character matched by `re` with the corresponding
// character in `cypher`.
text = text.replaceAllMapped(re, (m) => cypher[alphabet.indexOf(m[0]!)]);
print(text);
}
(This is not an efficient approach. It does a linear lookup in the alphabet for each character. A more efficient approach would either recognize that the alphabet is a contiguous range of character codes, and just do some arithmetic to find the position in the alphabet, or (if it wasn't a contiguous range) could build a more efficient lookup table for the alphabet first).
In Matlab, Consider the string:
str = 'text text text [[word1,word2,word3]] text text'
I want to isolate randomly one word of the list ('word1','word2','word3'), say 'word2', and then write, in a possibly new file, the string:
strnew = 'text text text word2 text text'
My approach is as follows (certainly pretty bad):
Isolating the string '[[word1,word2,word3]]' can be achieved via
str2=regexp(str,'\[\[(.*?)\]\]','match')
Removing the opening and closing square brackets in the string is achieved via
str3=str2(3:end-2)
Finally we can split str3 into a list of words (stored in a cell)
ListOfWords = split(str3,',')
which outputs {'word1'}{'word2'}{'word3'} and I am stuck there. How can I pick one of the entries and plug it back into the initial string (or a copy of it...)? Note that the delimiters [[ and ]] could both be changed to || if it can help.
You can do it as follows:
Use regexp with the 'split' option;
Split the middle part into words;
Select a random word;
Concatenate back.
str = 'text text text [[word1,word2,word3]] text text'; % input
str_split = regexp(str, '\[\[|\]\]', 'split'); % step 1
list_of_words = split(str_split{2}, ','); % step 2
chosen_word = list_of_words{randi(numel(list_of_words))}; % step 3
strnew = [str_split{1} chosen_word str_split{3}]; % step 4
I have a horrible solution. I was trying to see if I could do it in one function call. You can... but at what cost! Abusing dynamic regular expressions like this barely counts as one function call.
I use a dynamic expression to process the comma separated list. The tricky part is selecting a random element. This is made exceedingly difficult because MATLAB's syntax doesn't support paren indexing off the result of a function call. To get around this, I stick it in a struct so I can dot index. This is terrible.
>> regexprep(str,'\[\[(.*)\]\]',"${struct('tmp',split(string($1),',')).tmp(randi(count($1,',')+1))}")
ans =
'text text text word3 text text'
Luis definitely has the best answer, but I think it could be simplified a smidge by not using regular expressions.
str = 'text text text [[word1,word2,word3]] text text'; % input
tmp = extractBetween(str,"[[","]]"); % step 1
tmp = split(tmp, ','); % step 2
chosen_word = tmp(randi(numel(tmp))) ; % step 3
strnew = replaceBetween(str,"[[","]]",chosen_word,"Boundaries","Inclusive") % step 4
C#:
string mystring = "Hello World. & my name is < bob >. Thank You."
Console.Writeline(mystring.ToUpper())
I am trying to get all the text to be uppercase except--
& < >
Because these are my encoding and the encoding wont work unless the text is lower case.
You may split the string with a space, turn all the items not starting with & to upper and just keep the rest as is, and then join back into a string:
string mystring = "Hello World. & my name is < bob >. Thank You.";
string result = string.Join(" ", mystring.Split(' ').Select(m => m.StartsWith("&") ? m : m.ToUpper()));
Another approach is to use a regex to match &, 1+ word chars and then a ;, and match and capture other 1+ word char chunks and only turn to upper case the contents in Group 1:
var result = System.Text.RegularExpressions.Regex.Replace(mystring,
#"&\w+;|(\w+)", m =>
m.Groups[1].Success ? m.Groups[1].Value.ToUpper() :
m.Value
);
I want to write an algorithm that removes every word started by an uppercase character in a string.
For example:
Original string: "Today is Friday the 29Th."
Desired result: "is the 29Th."
I wrote this algorithm, but it is not complete:
def removeUpperCaseChars(str: String) = {
for (i <- 0 to str.length - 1) {
if (str.charAt(i).isUpper) {
var j = i
var cont = i
while (str.charAt(j) != " ") {
cont += 1
}
val subStr = str.substring(0, i) + str.substring(cont, str.length - 1)
println(subStr)
}
}
}
It (supposedly) removes every word with uppercase characters instead of removing only the words that start with uppercase characters. And worse than that, Scala doesn't give any result.
Can anyone help me with this problem?
With some assumptions, like words are always split with a space you can implement it like this:
scala> "Today is Friday the 29Th.".split("\\s+").filterNot(_.head.isUpper).mkString(" ")
res2: String = is the 29Th.
We don't really want to write algorithms in the way you did in scala. This is reather a way you would do this in C.
How about string.replaceAll("""\b[A-Z]\w+""", "")?
i want to remove words that are not in a list, from a string.
for example i have the string "i like pie and cake" or "pie and cake is good" and i want to remove words that are not "pie" or "cake" and end out with a string saying "pie cake".
it would be great, if the words it does not delete could be loaded from a table.
Here's another solution, but you may need to trim the last space in the result.
acceptable = { "pie", "cake" }
for k,v in ipairs(acceptable) do acceptable[v]=v.." " end
setmetatable(acceptable,{__index= function () return "" end})
function strip(s,t)
s=s.." "
print('"'..s:gsub("(%a+) %s*",t)..'"')
end
strip("i like pie and cake",acceptable)
strip("pie and cake is good",acceptable)
gsub is the key point here. There are other variations using gsub and a function, instead of setting a metatable for acceptable.
local function stripwords(inputstring, inputtable)
local retstring = {}
local itemno = 1;
for w in string.gmatch(inputstring, "%a+") do
if inputtable[w] then
retstring[itemno] = w
itemno = itemno + 1
end
end
return table.concat(retstring, " ")
end
Provided that the words you want to keep are all keys of the inputtable.
The following also implements the last part of the request (I hope):
it would be great, if the words it does not delete could be loaded from a table.
function stripwords(str, words)
local w = {};
return str:gsub("([^%s.,!?]+)%s*", function(word)
if words[word] then return "" end
w[#w+1] = word
end), w;
end
Keep in mind that the pattern matcher of Lua is not compatible with multibyte strings. This is why I used the pattern above. If you don't care about multibyte strings, you can use something like "(%a+)%s". In that case I would also run the words through string.upper
Tests / Usage
local blacklist = { some = true, are = true, less = true, politics = true }
print((stripwords("There are some nasty words in here!", blacklist)))
local r, t = stripwords("some more are in politics here!", blacklist);
print(r);
for k,v in pairs(t) do
print(k, v);
end