Removing accents and diacritics in kotlin - string

Is there any way to convert string like 'Dziękuję' to 'Dziekuje' or 'šećer' to 'secer' in kotlin. I have tried using java.text.Normalizer but it doesn't seem to work the desired way.

Normalizer only does half the work. Here's how you could use it:
private val REGEX_UNACCENT = "\\p{InCombiningDiacriticalMarks}+".toRegex()
fun CharSequence.unaccent(): String {
val temp = Normalizer.normalize(this, Normalizer.Form.NFD)
return REGEX_UNACCENT.replace(temp, "")
}
assert("áéíóů".unaccent() == "aeiou")
And here's how it works:
We are calling the normalize(). If we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.
Source: http://www.rgagnon.com/javadetails/java-0456.html
Note that Normalizer is a Java class; this is not pure Kotlin and it will only work on JVM.

TL;DR:
Use Normalizer to canonically decomposed the Unicode thext.
Remove non-spacing combining characters (\p{Mn}).
fun String.removeNonSpacingMarks() =
Normalizer.normalize(this, Normalizer.Form.NFD)
.replace("\\p{Mn}+".toRegex(), "")
Long answer:
Using Normalizer you can transform the original text into an equivalent composed or decomposed form.
NFD: Canonical decomposition.
NFC: Canonical decomposition, followed by canonical composition.
.
(more info about normalization can be found in the Unicode® Standard Annex #15)
In our case, we are interested in NFD normalization form because it allows us to separate all the combined characters from the base character.
After decomposing the text, we have to run a regex to remove all the new characters resulting from the decomposition that correspond to combined characters.
Combined characters are special characters intended to be positioned relative to an associated base character. The Unicode Standard distinguishes two types of combining characters: spacing and nonspacing.
We are only interested in non-spacing combining characters. Diacritics are the principal class (but not the only one) of this group used with Latin, Greek, and Cyrillic scripts and their relatives.
To remove non-spacing characters with a regex we have to use \p{Mn}. This group includes all the 1,826 non-spacing characters.
Other answers uses \p{InCombiningDiacriticalMarks}, this block only includes combining diacritical marks. It is a subset of \p{Mn} that includes only 112 characters.

This is an extension function you can use and extend further:
fun String.normalize(): String {
val original = arrayOf("ę", "š")
val normalized = arrayOf("e", "s")
return this.map { it ->
val index = original.indexOf(it.toString())
if (index >= 0) normalized[index] else it
}.joinToString("")
}
Use it like this:
val originalText = "aębšc"
val normalizedText = originalText.normalize()
println(normalizedText)
will print
aebsc
Extend the arrays original and normalized with as many elements as you need.

In case anyone is strugling to do this in kotlin, this code works like a charm.
To avoid inconsistencies I also use .toUpperCase and Trim(). then i cast this function:
fun stripAccents(s: String):String{
if (s == null) {
return "";
}
val chars: CharArray = s.toCharArray()
var sb = StringBuilder(s)
var cont: Int = 0
while (chars.size > cont) {
var c: kotlin.Char
c = chars[cont]
var c2:String = c.toString()
//these are my needs, in case you need to convert other accents just Add new entries aqui
c2 = c2.replace("Ã", "A")
c2 = c2.replace("Õ", "O")
c2 = c2.replace("Ç", "C")
c2 = c2.replace("Á", "A")
c2 = c2.replace("Ó", "O")
c2 = c2.replace("Ê", "E")
c2 = c2.replace("É", "E")
c2 = c2.replace("Ú", "U")
c = c2.single()
sb.setCharAt(cont, c)
cont++
}
return sb.toString()
}
to use these fun cast the code like this:
var str: String
str = editText.text.toString() //get the text from EditText
str = str.toUpperCase().trim()
str = stripAccents(str) //call the function

Related

Kotlin - Way to get a substring starting from a specified index until another specified index or end of string?

Example
val string = "Large mountain"
I would like to get a substring starting from the index of the "t" character until index of "t" + 7 with the 7 being arbitrary or end of string.
val substring = "tain"
Assuming that the string is larger
val string2 = "Large mountain and lake"
I would like to return
val substring2 = "tain and l"
If my I were to try to substring(indexOf("t") ,(indexOf("t") + 7) )
In this second case right now if I use "Large mountain" I would get an index out of bounds exception.
I don't think there's an especially elegant way to do this.
One fairly short and readable way is:
val substring = string.drop(string.indexOf('t')).take(7)
This uses indexOf() to locate the first 't' in the string, and then drop() to drop all the previous characters, and take() to take (up to) 7 characters from there.
However, it creates a couple of temporary strings, and will give an IllegalArgumentException if there's no 't' in the string.
Improving robustness and efficiency takes more code, e.g.:
val substring = string.indexOf('t').let {
if (it >= 0)
string.substring(it, min(it + 7, string.length))
else
string
}
That version lets you control the result when there's no 't' (in the else branch); it also avoids creating any temporary objects. As before, it uses indexOf() to locate the first 't', but then min() to work out how long the substring can be, and substring() to generate it in one go.
If you were doing this a lot, you could of course put it into your own function, e.g.:
fun String.substringFrom(char: Char, maxLen: Int)
= indexOf(char).let {
if (it >= 0)
substring(it, min(it + maxLen, length))
else
this
}
which you could then call with e.g. "Large mountain".substringFrom('t', 7)

Substring after a character

I'm looking for a way to beautifully extract 'user id' from string in Groovy. Lets say I have string "key::${userId}" For example:
String s = "key::123456"
I can extract userId in java style as following
Long.parseLong(s.substring(s.indexOf("::") + 2))
But I believe that there is a way to make it shorter and more neatly
If key:: is always the prefix, you can use the - operator, combined with the as keyword for the String to long conversion:
String s = 'key::123456'
long userId = (s - 'key::') as long
You can use multiple assignment operator combined with tokenize method:
def (_,userId) = "key::123456".tokenize("::")
assert userId == "123456"

How to Split a string with a set of delimiters and find what delimiter it was? Kotlin

So I am learning Kotlin now, and I was trying to do a calculator where if we can give expression like 4+3 or 3*5 and we will get the answer so I was trying to split that input string and then find what operator is used and what are the operands.
var list = str.split("+","-","*","/" )
so how can i get the delimiter that is used to split that string too.
I'm afraid that split method doesn't have this feature. You would have to split the the string via separate split calls. And compare the outcome with original string. If the string wasn't split by given delimiter that outcome should be the same.
Eg. like this:
var str = "5+1"
var delimiters = arrayOf("+","-","*","/")
var found = "Not found"
for (delimiter in delimiters) {
var splited = str.split(delimiter)
if(splited[0] != str) {
found = delimiter
break
}
}
println(found)

How to manipulate Strings in Scala while using the Play framework?

I am using the play framework 2.2.1 and I have a question concerning the manipulation of Strings within view templates. Unfortunately I am not very familiar with the Scala programming language nor its APIs. The strings are contained in a List which is passed from the controller to the view and then I use a loop to process each string before they are added to the html. I would like to know how to do the following: trim, toLowerCase and remove spaces. As an example, if I have "My string ", I would like to produce "mystring". More specifically I would actually like to produce "myString", however I'm sure I can figure that out if someone points me in the right direction. Thanks.
UPDATE:
Fiaz provided a great solution, building on his answer and just for interest sake I came up with the following solution using recursion. This example is of course making many assumptions about the input provided.
#formatName(name: String) = #{
def inner(list: List[String], first: Boolean): String = {
if (!list.tail.isEmpty && first) list.head + inner(list.tail, false)
else if (!list.tail.isEmpty && !first) list.head.capitalize + inner(list.tail, false)
else if (list.tail.isEmpty && !first) list.head.capitalize
else list.head
}
if (!name.trim.isEmpty) inner(name.split(' ').map(_.toLowerCase).toList, true)
else ""
}
If you want to know how to do just the trimming, lower-casing and joining without spaces, try this perhaps?
// Given that s is your string
s.split(" ").map(_.toLowerCase).mkString
That splits a string into an array strings, splitting is done on one or more spaces so that gives you trimmed strings. You then map each element in the array with the function (x => x.toLowerCase) (for which the shorthand is (_.toLowerCase)) and then join the Array back into a single string using the mkString method that collections have.
So let's say you want to capitalize the first letter of the each of the space-split bits:
Scala provides a capitalize method on Strings, so you could use that:
s.split(" ").map(_.toLowerCase.capitalize).mkString
See http://www.scala-lang.org/api/current/scala/collection/immutable/StringOps.html
One suggestion as to how you can get the exact output (your example 'myString') you describe:
(s.split(" ").toList match {
case fst::rest => fst.toLowerCase :: rest.map(_.toLowerCase.capitalize)
case Nil => Nil }
).mkString
There is example of using the string manipulation below:
#stringFormat(value: String) = #{
value.replace("'", "\\'")
}
#optionStringFormat(description: Option[String]) = #{
if (description.isDefined) {
description.get.replace("'", "\\'").replace("\n", "").replace("\r", "")
} else {
""
}
}
#for(photo <- photos) {
<div id="photo" class="random" onclick="fadeInPhoto(#photo.id, '#photo.filename', '#stringFormat(photo.title)', '#optionStringFormat(photo.description)', '#byTags');">
This example obtained from https://github.com/joakim-ribier/play2-scala-gallery

How to split strings into characters in Scala

For example, there is a string val s = "Test". How do you separate it into t, e, s, t?
Do you need characters?
"Test".toList // Makes a list of characters
"Test".toArray // Makes an array of characters
Do you need bytes?
"Test".getBytes // Java provides this
Do you need strings?
"Test".map(_.toString) // Vector of strings
"Test".sliding(1).toList // List of strings
"Test".sliding(1).toArray // Array of strings
Do you need UTF-32 code points? Okay, that's a tougher one.
def UTF32point(s: String, idx: Int = 0, found: List[Int] = Nil): List[Int] = {
if (idx >= s.length) found.reverse
else {
val point = s.codePointAt(idx)
UTF32point(s, idx + java.lang.Character.charCount(point), point :: found)
}
}
UTF32point("Test")
You can use toList as follows:
scala> s.toList
res1: List[Char] = List(T, e, s, t)
If you want an array, you can use toArray
scala> s.toArray
res2: Array[Char] = Array(T, e, s, t)
Actually you don't need to do anything special. There is already implicit conversion in Predef to WrappedString and WrappedString extends IndexedSeq[Char] so you have all goodies that available in it, like:
"Test" foreach println
"Test" map (_ + "!")
Edit
Predef has augmentString conversion that has higher priority than wrapString in LowPriorityImplicits. So String end up being StringLike[String], that is also Seq of chars.
Additionally, it should be noted that if what you actually want isn't an actual list object, but simply to do something which each character, then Strings can be used as iterable collections of characters in Scala
for(ch<-"Test") println("_" + ch + "_") //prints each letter on a different line, surrounded by underscores

Resources