I'm trying to invert a string in go but I'm having trouble handling the characters. Unlike C, GO treats strings as vectors of bytes, rather than characters, which are called runes here. I tried to do some type conversions to do the assignments, but so far I could not.
The idea here is to generate 5 strings with random characters of sizes 100, 200, 300, 400 and 500 and then invert their characters. I was able to make C work with ease, but in GO, the language returns an error saying that it is not possible to perform the assignment.
func inverte() {
var c = "A"
var strs, aux string
rand.Seed(time.Now().UnixNano())
// Gera 5 vetores de 100, 200, 300, 400, e 500 caracteres
for i := 1; i < 6; i++ {
strs = randomString(i * 100)
fmt.Print(strs)
for i2, j := 0, len(strs); i2 < j; i2, j = i+1, j-1 {
aux = strs[i2]
strs[i2] = strs[j]
strs[j] = aux
}
}
}
If you want to take into account unicode combining characters (characters that are intended to modify other characters, like an acute accent ´ + e = é), Andrew Sellers has an interesting take in this gist.
It starts by listing the Unicode block range for all combining diacritical marks (CDM) (the Unicode block containing the most common combining characters)
regulars (inherited), so the usual ◌̀ ◌́ ◌̂ ◌̃ ◌̄ ◌̅ ◌̆ ◌̇ ◌̈, ...;
extended (containing diacritical marks used in German dialectology -- Teuthonista)
supplement (or the Uralic Phonetic Alphabet, Medievalist notations, and German dialectology -- again, Teuthonista)
for symbols (arrows, dots, enclosures, and overlays for modifying symbol characters)
Half Marks (diacritic mark parts for spanning multiple characters, as seen here)
var combining = &unicode.RangeTable{
R16: []unicode.Range16{
{0x0300, 0x036f, 1}, // combining diacritical marks
{0x1ab0, 0x1aff, 1}, // combining diacritical marks extended
{0x1dc0, 0x1dff, 1}, // combining diacritical marks supplement
{0x20d0, 0x20ff, 1}, // combining diacritical marks for symbols
{0xfe20, 0xfe2f, 1}, // combining half marks
},
}
You can then read, rune after rune, your initial string:
sv := []rune(s)
But if you do so in reverse order, you will encounter combining diacritical marks (CDMs) first, and those need to preserve their order, to not be reversed
for ix := len(sv) - 1; ix >= 0; ix-- {
r := sv[ix]
if unicode.In(r, combining) {
cv = append(cv, r)
fmt.Printf("Detect combining diacritical mark ' %c'\n", r)
}
(note the space around the %c combining rune: '%c' without space would means combining the mark with the first 'ͤ': instead of ' ͤ '. I tried to use the CGJ Combining Grapheme Joiner \u034F, but that does not work)
If you encounter finally a regular rune, you need to combine with those CDMs, before adding it to your reverse final rune array.
} else {
rrv := make([]rune, 0, len(cv)+1)
rrv = append(rrv, r)
rrv = append(rrv, cv...)
fmt.Printf("regular mark '%c' (with '%d' combining diacritical marks '%s') => '%s'\n", r, len(cv), string(cv), string(rrv))
rv = append(rv, rrv...)
cv = make([]rune, 0)
}
Where it gets even more complex is with emojis, and, for instance more recently, modifiers like the Medium-Dark Skin Tone, the type 5 on the Fitzpatrick Scale of skin tones.
If ignored, Reverse '👩🏾🦰👱🏾🧑🏾⚖️' will give '️⚖🏾🧑🏾👱🦰🏾👩', loosing the skin tone on the last two emojis.
And don't get me started on the ZERO WIDTH JOINER (200D), which, from Wisdom/Awesome-Unicode, forces adjacent characters to be joined together (e.g., Arabic characters or supported emoji). It Can be used this to compose sequentially combined emoji.
Here are two examples of composed emojis, whose inner elements order should remain in the same order when "reversed":
👩🏾🦰 alone is (from Unicode to code points converter):
👩: women (1f469)
dark skin (1f3fe)
ZERO WIDTH JOINER (200d)
🦰red hair (1f9b0)
Those should remain in the exact same order.
The "character" "judge" (meaning an abstract idea of the semantic value for "judge") can be represented with several glyphs or one glyph.
🧑🏾⚖️ is actually one composed glyph (composed here of two emojis), representing a judge. That sequence should not be inverted.
The program below correctly detect the "zero width joiner" and do not invert the emojis it combines.
It you inspect that emoji, you will find it composed of:
🧑: Adult (1F9D1)
🏾: dark skin (1f3fe)
ZERO WIDTH JOINER (200d) discussed above
⚖: scale (2696)
VARIATION SELECTOR (FE0F), part of the unicode combining characters (characters that are intended to modify other characters), here requesting that character 'scale' to be displayed emoji-style (with color) ⚖️, using VS16 (U+FE0F), instead of text style (monochrome) '⚖', using VS15 (U+FE0E).
Again, that sequence order needs to be preserved.
Note: the actual judge emoji 👨🏾⚖️ uses a MAN 🧑 (1F468), instead of an Adult 🧑 (1F9D1) (plus the other characters listed above: dark skin, ZWJ, scale), and is therefore represented as one glyph, instead of a cluster of graphemes.
Meaning: the single glyph, official emoji for "judge", needs to combine "man" with "scale" (resulting in one glyph 👨🏾⚖️) instead of "adult" + "scale".
The latter, "adult" + "scale", is still considered as "one character": you cannot select just the scale, because of the ZWJ (Zero Width Joiner).
But that "character" is represented as a composed glyph 🧑🏾⚖️, two glyphs, each one a concrete written representations a corresponding grapheme through codepoint+font)
Obviously, using the first combination ("man"+"scale") results in a more expressive character 👨🏾⚖️.
See "The relationship between graphemes and abstract characters for textual representation"
Graphemes and orthographic characters are fairly concrete objects, in the sense that they are familiar to common users—non-experts, who are typically taught to work in terms of them from the time they first learn their “ABCs” (or equivalent from their writing system, of course).
In the domain of information systems, however, we have a different sense of character: abstract characters which are minimal units of textual representation within a given system.
These are, indeed, abstract in two important senses:
first, some of these abstract characters may not correspond to anything concrete in an orthography, as we saw above in the case of HORIZONTAL TAB.
Secondly, the concrete objects of writing (graphemes and orthographic characters) can be represented by abstract characters in more than one way, and not necessarily in a one-to-one manner, as we saw above in the case of “ô” being represented by a sequence <O, CIRCUMFLEX>.
Then: "From grapheme to codepoint to glyph":
Graphemes are the units in terms of which users are usually accustomed to thinking.
Within the computer, however, processes are done in terms of characters.
We don’t make any direct connection between graphemes and glyphs.
As we have defined these two notions here, there is no direct connection between them. They can only be related indirectly through the abstract characters.
This is a key point to grasp: the abstract characters are the element in common through which the others relate.
Full example in Go playground.
Reverse 'Hello, World' => 'dlroW ,olleH'
Reverse '👽👶⃠🎃' => '🎃👶⃠👽'
Reverse '👩🏾🦰👱🏾🧑🏾⚖️' => '🧑🏾⚖️👱🏾👩🏾🦰'
Reverse 'aͤoͧiͤ š́ž́ʟ́' => 'ʟ́ž́š́ iͤoͧaͤ'
Reverse 'H̙̖ell͔o̙̟͚͎̗̹̬ ̯W̖͝ǫ̬̞̜rḷ̦̣̪d̰̲̗͈' => 'd̰̲̗͈ḷ̦̣̪rǫ̬̞̜W̖͝ ̯o̙̟͚͎̗̹̬l͔leH̙̖'
As you correctly identified, go strings are immutable, so you cannot assign to rune/character values at given indices.
Instead of reversing the string in-place one must create a copy of the runes in the string and reverse those instead, and then return the resulting string.
For example (Go Playground):
func reverse(s string) string {
rs := []rune(s)
for i, j := 0, len(rs)-1; i < j; i, j = i+1, j-1 {
rs[i], rs[j] = rs[j], rs[i]
}
return string(rs)
}
func main() {
fmt.Println(reverse("Hello, World!"))
// !dlroW ,olleH
fmt.Println(reverse("Hello, 世界!"))
// !界世 ,olleH
}
There are problems with this approach due to the intricacies of Unicode (e.g. combining diacritical marks) but this will get you started.
Related
So I'm writing simple parsers for some programming languages in SWI-Prolog using Definite Clause Grammars. The goal is to return true if the input string or file is valid for the language in question, or false if the input string or file is not valid.
In all almost all of the languages there is an "identifier" predicate. In most of the languages the identifier is defined as the one of the following in EBNF: letter { letter | digit } or ( letter | digit ) { letter | digit }, that is to say in the first case a letter followed by zero or more alphanumeric characters, or i
My input file is split into a list of word strings (i.e. someIdentifier1 = 3 becomes the list [someIdentifier1,=,3]). The reason for the string to be split into lists of words rather than lists of letters is for recognizing keywords defined as terminals.
How do I implement "identifier" so that it recognizes any alphanumeric string or a string consisting of a letter followed by alphanumeric characters.
Is it possible or necessary to further split the word into letters for this particular predicate only, and if so how would I go about doing this? Or is there another solution, perhaps using SWI-Prolog libraries' built-in predicates?
I apologize for the poorly worded title of this question; however, I am unable to clarify it any further.
First, when you need to reason about individual letters, it is typically most convenient to reason about lists of characters.
In Prolog, you can easily convert atoms to characters with atom_chars/2.
For example:
?- atom_chars(identifier10, Cs).
Cs = [i, d, e, n, t, i, f, i, e, r, '1', '0'].
Once you have such characters, you can used predicates like char_type/2 to reason about properties of each character.
For example:
?- char_type(i, T).
T = alnum ;
T = alpha ;
T = csym ;
etc.
The general pattern to express identifiers such as yours with DCGs can look as follows:
identifier -->
[L],
{ letter(L) },
identifier_rest.
identifier_rest --> [].
identifier_rest -->
[I],
{ letter_or_digit(I) },
identifier_rest.
You can use this as a building block, and only need to define letter/1 and letter_or_digit/1. This is very easy with char_type/2.
Further, you can of course introduce an argument to relate such lists to atoms.
What is an efficient way to produce phrase anagrams given a string?
The problem I am trying to solve
Assume you have a word list with n words. Given an input string, say, "peanutbutter", produce all phrase anagrams. Some contenders are: pea nut butter, A But Ten Erupt, etc.
My solution
I have a trie that contains all words in the given word list. Given an input string, I calculate all permutations of it. For each permutation, I have a recursive solution (something like this) to determine if that specific permuted string can be broken in to words. For example, if one of the permutations of peanutbutter was "abuttenerupt", I used this method to break it into "a but ten erupt". I use the trie to determine if a string is a valid word.
What sucks
My problem is that because I calculate all permutations, my solution runs very slow for phrases that are longer than 10 characters, which is a big let down. I want to know if there is a way to do this in a different way.
Websites like https://wordsmith.org/anagram/ can do the job in less than a second and I am curious to know how they do it.
Your problem can be decomposed to 2 sub-problems:
Find combination of words that use up all characters of the input string
Find all permutations of the words found in the first sub-problem
Subproblem #2 is a basic algorithm and you can find existing standard implementation in most programming language. Let's focus on subproblem #1
First convert the input string to a "character pool". We can implement the character pool as an array oc, where oc[c] = number of occurrence of character c.
Then we use backtracking algorithm to find words that fit in the charpool as in this pseudo-code:
result = empty;
function findAnagram(pool)
if (pool empty) then print result;
for (word in dictionary) {
if (word fit in charpool) {
result = result + word;
update pool to exclude characters in word;
findAnagram(pool);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
Note: If we pass the charpool by value then we don't have to restore it. But as it is quite big, I prefer passing it by reference.
Now we remove redundant results and apply some optimizations:
Assuming A comes before B in the dictionary. If we choose the first word is B, then we don't have to consider word A in following steps, because those results (if we take A) would already be in the case where A is chosen as the first word
If the character set is small enough (< 64 characters is best), we can use a bitmask to quickly filter words that cannot fit in the pool. A bitmask mask which character is in a word, no matter how many time it occurs.
Update the pseudo-code to reflect those optimizations:
function findAnagram(charpool, minDictionaryIndex)
pool_bitmask <- bitmask(charpool);
if (pool empty) then print result;
for (word in dictionary AND word's index >= minDictionaryIndex) {
// bitmask of every words in the dictionary should be pre-calculated
word_bitmask <- bitmask(word)
if (word_bitmask contains bit(s) that is not in pool_bitmask)
then skip this for iteration
if (word fit in charpool) {
result = result + word;
update charpool to exclude characters in word;
findAnagram(charpool, word's index);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
My C++ implementation of subproblem #1 where the character set contains only lowercase 'a'..'z': http://ideone.com/vf7Rpl .
Instead of a two stage solution where you generate permutations and then try and break them into words, you could speed it up by checking for valid words as you recursively generate the permutations. If at any point your current partially-complete permutation does not correspond to any valid words, stop there and do not recurse any further. This means you don't waste time generating useless permutations. For example, if you generate "tt", there is no need to permute "peanubuter" and append all the permutations to "tt" because there are no English words beginning with tt.
Suppose you are doing basic recursive permutation generation, keep track of the current partial word you have generated. If at any point it is a valid word, you can output a space and start a new word, and recursively permute the remaining character. You can also try adding each of the remaining characters to the current partial word, and only recurse if doing so results in a valid partial word (i.e. a word exists starting with those characters).
Something like this (pseudo-code):
void generateAnagrams(String partialAnagram, String currentWord, String remainingChars)
{
// at each point, you can either output a space, or each of the remaining chars:
// if the current word is a complete valid word, you can output a space
if(isValidWord(currentWord))
{
// if there are no more remaining chars, output the anagram:
if(remainingChars.length == 0)
{
outputAnagram(partialAnagram);
}
else
{
// output a space and start a new word
generateAnagrams(partialAnagram + " ", "", remainingChars);
}
}
// for each of the chars in remainingChars, check if it can be
// added to currentWord, to produce a valid partial word (i.e.
// there is at least 1 word starting with these characters)
for(i = 0 to remainingChars.length - 1)
{
char c = remainingChars[i];
if(isValidPartialWord(currentWord + c)
{
generateAnagrams(partialAnagram + c, currentWord + c,
remainingChars.remove(i));
}
}
}
You could call it like this
generateAnagrams("", "", "peanutbutter");
You could optimize this algorithm further by passing the node in the trie corresponding to the current partially completed word, as well as passing currentWord as a string. This would make your isValidPartialWord check even faster.
You can enforce uniqueness by changing your isValidWord check to only return true if the word is in ascending (greater or equal) alphabetic order compared to the previous word output. You might also need another check for dupes at the end, to catch cases where two of the same word can be output.
I believe there are no LeftStr(str,n) (take at most n first characters), RightStr(str,n) (take at most n last characters) and SubStr(str,pos,n) (take first n characters after pos) function in Go, so I tried to make one
// take at most n first characters
func Left(str string, num int) string {
if num <= 0 {
return ``
}
if num > len(str) {
num = len(str)
}
return str[:num]
}
// take at most last n characters
func Right(str string, num int) string {
if num <= 0 {
return ``
}
max := len(str)
if num > max {
num = max
}
num = max - num
return str[num:]
}
But I believe those functions will give incorrect output when the string contains unicode characters. What's the fastest solution for those function, is using for range loop is the only way?
As mentioned in already in comments,
combining characters, modifying runes, and other multi-rune
"characters"
can cause difficulties.
Anyone interested in Unicode handling in Go should probably read the Go Blog articles
"Strings, bytes, runes and characters in Go"
and "Text normalization in Go".
In particular, the later talks about the golang.org/x/text/unicode/norm package which can help in handling some of this.
You can consider several levels increasingly of more accurate (or increasingly more Unicode aware) spiting the first (or last) "n characters" from a string.
Just use n bytes.
This may split in the middle of a rune but is O(1), is very simple, and in many cases you know the input consists of only single byte runes.
E.g. str[:n].
Split after n runes.
This may split in the middle of a character. This can be done easily, but at the expense of copying and converting with just string([]rune(str)[:n]).
You can avoid the conversion and copying by using the unicode/utf8 package's DecodeRuneInString (and DecodeLastRuneInString) functions to get the length of each of the first n runes in turn and then return str[:sum] (O(n), no allocation).
Split after the n'th "boundary".
One way to do this is to use
norm.NFC.FirstBoundaryInString(str) repeatedly
or norm.Iter to find the byte position to split at and then return str[:pos].
Consider the displayed string "cafés" which could be represented in Go code as: "cafés", "caf\u00E9s", or "caf\xc3\xa9s" which all result in the identical six bytes. Alternative it could represented as "cafe\u0301s" or "cafe\xcc\x81s" which both result in the identical seven bytes.
The first "method" above may split those into "caf\xc3"+"\xa9s" and cafe\xcc"+"\x81s".
The second may split them into "caf\u00E9"+"s" ("café"+"s") and "cafe"+"\u0301s" ("cafe"+"́s").
The third should split them into "caf\u00E9"+"s" and "cafe\u0301"+"s" (both shown as "café"+"s").
How does one detect if a string is binary safe or not in Go?
A function like:
IsBinarySafe(str) //returns true if its safe and false if its not.
Any comment after this are just things I have thought or attempted to solve this:
I assumed that there must exist a library that already does this but had a tough time finding it. If there isn't one, how do you implement this?
I was thinking of some solution but wasn't really convinced they were good solutions.
One of them was to iterate over the bytes, and have a hash map of all the illegal byte sequences.
I also thought of maybe writing a regex with all the illegal strings but wasn't sure if that was a good solution.
I also was not sure if a sequence of bytes from other languages counted as binary safe. Say the typical golang example:
世界
Would:
IsBinarySafe(世界) //true or false?
Would it return true or false? I was assuming that all binary safe string should only use 1 byte. So iterating over it in the following way:
const nihongo = "日本語abc日本語"
for i, w := 0, 0; i < len(nihongo); i += w {
runeValue, width := utf8.DecodeRuneInString(nihongo[i:])
fmt.Printf("%#U starts at byte position %d\n", runeValue, i)
w = width
}
and returning false whenever the width was great than 1. These are just some ideas I had just in case there wasn't a library for something like this already but I wasn't sure.
Binary safety has nothing to do with how wide a character is, it's mainly to check for non-printable characters more or less, like null bytes and such.
From Wikipedia:
Binary-safe is a computer programming term mainly used in connection
with string manipulating functions. A binary-safe function is
essentially one that treats its input as a raw stream of data without
any specific format. It should thus work with all 256 possible values
that a character can take (assuming 8-bit characters).
I'm not sure what your goal is, almost all languages handle utf8/16 just fine now, however for your specific question there's a rather simple solution:
// checks if s is ascii and printable, aka doesn't include tab, backspace, etc.
func IsAsciiPrintable(s string) bool {
for _, r := range s {
if r > unicode.MaxASCII || !unicode.IsPrint(r) {
return false
}
}
return true
}
func main() {
fmt.Printf("len([]rune(s)) = %d, len([]byte(s)) = %d\n", len([]rune(s)), len([]byte(s)))
fmt.Println(IsAsciiPrintable(s), IsAsciiPrintable("test"))
}
playground
From unicode.IsPrint:
IsPrint reports whether the rune is defined as printable by Go. Such
characters include letters, marks, numbers, punctuation, symbols, and
the ASCII space character, from categories L, M, N, P, S and the ASCII
space character. This categorization is the same as IsGraphic except
that the only spacing character is ASCII space, U+0020.
I would also like to know which algorithm has the worst case complexity of all for finding all occurrences of a string in another. Seems like Boyer–Moore's algorithm has a linear time complexity.
The KMP algorithm has linear complexity for finding all occurrences of a pattern in a string, like the Boyer-Moore algorithm¹. If you try to find a pattern like "aaaaaa" in a string like "aaaaaaaaa", once you have the first complete match,
aaaaaaaaa
aaaaaa
aaaaaa
^
the border table contains the information that the next longest possible match (corresponding to the widest border of the pattern) of a prefix of the pattern is just one character short (a complete match is equivalent to a mismatch one past the end of the pattern in this respect). Thus the pattern is moved one place further, and since from the border table it is known that all characters of the pattern except possibly the last match, the next comparison is between the last pattern character and the aligned text character. In this particular case (find occurrences of am in an), which is the worst case for the naive matching algorithm, the KMP algorithm compares each text character exactly once.
In each step, at least one of
the position of the text character compared
the position of the first character of the pattern with respect to the text
increases, and neither ever decreases. The position of the text character compared can increase at most length(text)-1 times, the position of the first pattern character can increase at most length(text) - length(pattern) times, so the algorithm takes at most 2*length(text) - length(pattern) - 1 steps.
The preprocessing (construction of the border table) takes at most 2*length(pattern) steps, thus the overall complexity is O(m+n) and no more m + 2*n steps are executed if m is the length of the pattern and n the length of the text.
¹ Note that the Boyer-Moore algorithm as commonly presented has a worst-case complexity of O(m*n) for periodic patterns and texts like am and an if all matches are required, because after a complete match,
aaaaaaaaa
aaaaaa
aaaaaa
^
<- <-
^
the entire pattern would be re-compared. To avoid that, you need to remember how long a prefix of the pattern still matches after the shift following a complete match and only compare the new characters.
There is a long article on KMP at http://en.wikipedia.org/wiki/Knuth-morris-pratt which ends with saying
Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the overall algorithm is O(n + k).
These complexities are the same, no matter how many repetitive patterns are in W or S.
(end quote)
So the total cost of a KMP search is linear in the number of characters of string and pattern. I think this holds even if you need to find multiple occurrences of the pattern in the string - and if not, just consider searching for patternQ, where Q is a character that does not occur in the text, and noting down where the KMP state shows that it has matched everything up to the Q.
You can count Pi function for a string in O(length). KMP builds a special string that has length n+m+1, and counts Pi function on it, so in any case complexity will be O(n+m+1)=O(n+m)
If you think about it, the worst case for matching the pattern is the one in which you've to visit each index of the LPS array, when mismatch occurs. For example, pattern "aaaa" which creates LPS arrays as [0,1,2,3] makes it possible.
Now, for the worst case matching in the text, we want to maximize the such mismatches that forces us to visit all the indices of the LPS array. That would be a text with repeated pattern, but with the last character as a mismatch. For example, "aaabaaacaaabaaacaaabaaac".
Let the length of the text be n and that of pattern be m. Number of the occurences of such pattern in the text is n/m. And for each of these occurences, we are performing m comparisions. Not to forget that we are also traversing n characters of the text.
Therefore, the worst case time for KMP matching would be O(n + (n/m)*m), which is basically O(n).
Total worst case time complexity, including LPS creation, would be O(n+m).
KMP Code (for reference):
void createLPS(char[] pattern,int[] lps){
int m = pattern.length;
int i=1;
int j=0;
lps[j]=0;
while(i<m){
if(pattern[j]==pattern[i]){
lps[i]=j+1;
i++;
j++;
}else{
if(j!=0){
j = lps[j-1];
}else{
lps[i]=0;
i++;
}
}
}
}
List<Integer> match(char[] str, char[] pattern, int[] lps){
int m = pattern.length;
int n = str.length;
int i=0, j=0;
List<Integer> idxs = new ArrayList<>();
while(i<n){
if(pattern[j]==str[i]){
j++;
i++;
}else{
if(j!=0){
j = lps[j-1];
}else{
i++;
}
}
if(j==m){
idxs.add(i-m);
j = lps[j-1];
}
}
return idxs;
}