Ignore character accents when sorting strings

Ignore character accents when sorting strings - string

I'm writing a golang program, which takes a list of strings and sorts them into bucket lists by the first character of string. However, I want it to group accented characters with the unaccented character that it most resembles. So, if I have a bucket for the letter A, then I want strings that start with Á to be included.
Does Go have anything built-in for determining this, or is my best bet to just have a large switch statement with all characters and their accented variations?

Looks like there are some addon packages for this. Here's an example...
package main
import (
"fmt"
"golang.org/x/text/collate"
"golang.org/x/text/language"
)
func main() {
strs := []string{"abc", "áab", "aaa"}
cl := collate.New(language.English, collate.Loose)
cl.SortStrings(strs)
fmt.Println(strs)
}
outputs:
[aaa áab abc]
Also, check out the following reference on text normalization:
http://blog.golang.org/normalization

Related

Writing Bytes to strings.builder prints nothing

I am learning go and am unsure why this piece of code prints nothing
package main
import (
"strings"
)
func main(){
var sb strings.Builder
sb.WriteByte(byte(127))
println(sb.String())
}
I would expect it to print 127

You are appending a byte to the string's buffer, not the characters "127".
Since Go strings are UTF-8, any number <=127 will be the same character as that number in ASCII. As you can see in this ASCII chart, 127 will get you the "delete" character. Since "delete" is a non-printable character, println doesn't output anything.
Here's an example of doing the same thing from your question, but using a printable character. 90 for "Z". You can see that it does print out Z.
If you want to append the characters "127" you can use sb.WriteString("127") or sb.Write([]byte("127")). If you want to append the string representation of a byte, you might want to look at using fmt.Sprintf.
Note: I'm not an expert on character encoding so apologies if the terminology in this answer is incorrect.

Python3 and combining Diacritics

I've been having a problem with Unicode in python3 and I can't seem to understand why that's happening.
symbol= "ῇ̣"
print(len(symbol))
>>>>2
This letter comes from a word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ where I have combining diacritical marks. I want to do the statistical analysis in Python 3 and store the results in a database, the thing is that I also store the character's position (index) in the text. The database-application correctly counts the symbol-variable in the example as one-character, whereas Python counts it as two - throwing off the entire indexing.
The project requires me to keep the diacritics, so I can't simply ignore them or do a .replace("combining diacritical mark","") on the string.
Since Python3 has unicode as default for strings I'm a bit dumbfounded by this.
I have tried to use the base(), strip(), and strip_length() method from Greek-accentuation: https://pypi.org/project/greek-accentuation/ but that's not helping either.
Project requirements are:
Detect the alphabet belonging to the character (OK)
Store string-positions (needed for highlighting in the database) (NotOK)
Be able to process multiple languages/alphabets mixed in one string. (OK)
Iterate over CSV-input. (OK)
Ignore set of predefined strings (OK)
Ignore set of strings that match certain conditions (OK)
This is the simplified code for this project:
# -*- coding: utf-8 -*-
import csv
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
with open("tbltext.csv", "r", encoding="utf8") as txt:
data = csv.reader(txt)
for row in data:
text = row[1]
### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length '-',...)
###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.
for letter in text:
lang = ad.detect_alphabet(letter)
If I use the word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ as example with a forloop; my result is:
>>> word = "ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ"
>>> for letter in word:
... print(letter)
...
ἐ
̣
ν
̣
τ
̣
ῇ
̣
[
α
ὐ
τ
]
ῇ
How can I make Python see letters with a combining diacritical mark as one letter instead of making it print the letter and the diacritical mark separately?

The string has 2 in length, so this is correct: two code point:
>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list(unicodedata.name(c) for c in symbol)
['GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI', 'COMBINING DOT BELOW']
So you should not use len to count the characters.
You could count the characters that are non-combining, so:
>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
1
From: How do I get the "visible" length of a combining Unicode string in Python? (but I ported it to python3).
But this is also not the optimal solution, depending on the scope of counting characters. I think in your case it is enough, but fonts could merge characters into ligatures. On some languages, that are visually new (and very different) characters (and not like ligature in western languages).
As last comment: I think you should normalize strings. With above code, in this case it doesn't matter, but in other cases, you may get different results. Especially if someone used combatibility characters (e.g. mu for units, or Eszett, instead of the true Greek characters).

How to match a part of string before a character into one variable and all after it into another

I have a problem with splitting string into two parts on special character.
For example:
12345#data
or
1234567#data
I have 5-7 characters in first part separated with "#" from second part, where are another data (characters,numbers, doesn't matter what)
I need to store two parts on each side of # in two variables:
x = 12345
y = data
without "#" character.
I was looking for some Lua string function like splitOn("#") or substring until character, but I haven't found that.

Use string.match and captures.
Try this:
s = "12345#data"
a,b = s:match("(.+)#(.+)")
print(a,b)

See this documentation:
First of all, although Lua does not have a split function is its standard library, it does have string.gmatch, which can be used instead of a split function in many cases. Unlike a split function, string.gmatch takes a pattern to match the non-delimiter text, instead of the delimiters themselves
It is easily achievable with the help of a negated character class with string.gmatch:
local example = "12345#data"
for i in string.gmatch(example, "[^#]+") do
print(i)
end
See IDEONE demo
The [^#]+ pattern matches one or more characters other than # (so, it "splits" a string with 1 character).

How to detect when bytes can't be converted to string in Go?

There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte to string in Go?

You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid.
But! You might be thinking that converting non-UTF-8 bytes to a Go string is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString methods, or even round-trip back to a []byte (to Write, say).
There are two places in the language that Go does do UTF-8 decoding of strings for you.
when you do for i, r := range s the r is a Unicode code point as a value of type rune
when you do the conversion []rune(s), Go decodes the whole string to runes.
(Note that rune is an alias for int32, not a completely different type.)
In both these instances invalid UTF-8 is replaced with U+FFFD, the replacement character reserved for uses like this. More is in the spec sections on for statements and conversions between strings and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.
Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD is utf8.RuneError and returned by functions in utf8.
Here's a sample program showing what Go does with a []byte holding invalid UTF-8:
package main
import "fmt"
func main() {
a := []byte{0xff}
s := string(a)
fmt.Println(s)
for _, r := range s {
fmt.Println(r)
}
rs := []rune(s)
fmt.Println(rs)
}
Output will look different in different environments, but in the Playground it looks like
�
65533
[65533]

Make first letter of words uppercase in a string

I have a large array of strings such as this one:
"INTEGRATED ENGINEERING 5 Year (BSC with a Year in Industry)"
I want to capitalise the first letter of the words and make the rest of the words lowercase. So INTEGRATED would become Integrated.
A second spanner in the works - I want an exception to a few words such as and, in, a, with.
So the above example would become:
"Integrated Engineering 5 Year (Bsc with a Year in Industry)"
How would I do this in Go? I can code the loop/arrays to manage the change but the actual string conversion is what I struggle with.

There is a function in the built-in strings package called Title.
s := "INTEGRATED ENGINEERING 5 Year (BSC with a Year in Industry)"
fmt.Println(strings.Title(strings.ToLower(s)))
https://go.dev/play/p/THsIzD3ZCF9

You can use regular expressions for this task. A \w+ regexp will match all the words, then by using Regexp.ReplaceAllStringFunc you can replace the words with intended content, skipping stop words. In your case, strings.ToLower and strings.Title will be also helpful.
Example:
str := "INTEGRATED ENGINEERING 5 Year (BSC with a Year in Industry)"
// Function replacing words (assuming lower case input)
replace := func(word string) string {
switch word {
case "with", "in", "a":
return word
}
return strings.Title(word)
}
r := regexp.MustCompile(`\w+`)
str = r.ReplaceAllStringFunc(strings.ToLower(str), replace)
fmt.Println(str)
// Output:
// Integrated Engineering 5 Year (Bsc with a Year in Industry)
https://play.golang.org/p/uMag7buHG8
You can easily adapt this to your array of strings.

The below is an alternate to the accepted answer, which is now deprecated:
package main
import (
"fmt"
"golang.org/x/text/cases"
"golang.org/x/text/language"
)
func main() {
msg := "INTEGRATED ENGINEERING 5 Year (BSC with a Year in Industry)"
fmt.Println(cases.Title(language.English, cases.Compact).String(msg))
}

In Go 1.18 strings.Title() is deprecated.
Here you can read the following to know what to use now
you should use cases.Title instead.

Well you didn't specify the language you're using, so I'll give you a general answer. You have an array with a bunch of strings in it. First I'd make the entire string lower case, then just go through each character in the string (capitalize the first one, rest stay lower case). At this point you need to look for the space, this will help you divide up the words in each string. The first character after finding a space is obviously a different word and should be capitalized. You can verify the next word isn't and in with Or a as well.
I'm not at a computer so I can't give to a specific example, but I hope this gets to in the right direction at least

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Ignore character accents when sorting strings - string

Related

Writing Bytes to strings.builder prints nothing

Python3 and combining Diacritics

How to match a part of string before a character into one variable and all after it into another

How to detect when bytes can't be converted to string in Go?

Make first letter of words uppercase in a string

Categories

Resources