How to match by regexp 3 and 4 bytes UTF-8 - string

I just want to find 3-byte character in Go using regexp.
But it panic with
regexp: Compile(\x{E29AA4}): error parsing regexp: invalid escape
sequence: \x{E29AA4
func get_words_from(text string) []string {
words := regexp.MustCompile(`\x{E29AA4}`)
return words.FindAllString(text, -1)
}
func main() {
text := "One,ВАПОЛтлдо⚤two ыаплд⚤ы ыапю.ы./\tавt𒀅hr𓀋ee!"
fmt.Println(get_words_from(text))
}
You can try on playground

Decode the UTF-8 byte sequence E2 9A A4 with e.g. utf8.DecodeRune() and use the resulting rune in the regexp:
func get_words_from(text string) []string {
r, _ := utf8.DecodeRune([]byte{0xE2, 0x9A, 0xA4})
words := regexp.MustCompile(string(r))
return words.FindAllString(text, -1)
}
You may also simply convert the byte slice to string (which interprets it as UTF-8 encoded bytes):
func get_words_from2(text string) []string {
s := string([]byte{0xE2, 0x9A, 0xA4})
words := regexp.MustCompile(s)
return words.FindAllString(text, -1)
}
Or use the equivalent unicode code point (which is 0x26a4) in the regexp string:
func get_words_from3(text string) []string {
words := regexp.MustCompile("\u26a4")
return words.FindAllString(text, -1)
}
Note that "\u26a4" is an interpreted string literal and will be unescaped by the Go compiler (not the regexp package).
All examples return (try the examples on the Go Playground):
[⚤ ⚤]
To filter out all runes that have 3 or more bytes in UTF-8, you may use a for range and utf8.RuneLen():
text := "One,ВАПОЛтлдо⚤two ыаплд⚤ы ыапю.ы./\tавt𒀅hr𓀋ee!"
fmt.Println(text)
var out []rune
for _, r := range text {
if utf8.RuneLen(r) < 3 {
out = append(out, r)
}
}
fmt.Println(string(out))
This outputs (try it on the Go Playground):
One,ВАПОЛтлдо⚤two ыаплд⚤ы ыапю.ы./ авt𒀅hr𓀋ee!
One,ВАПОЛтлдоtwo ыаплды ыапю.ы./ авthree!
Or use strings.Map(), where you return -1 for such runes, which then will be left out in the result:
out := strings.Map(func(r rune) rune {
if utf8.RuneLen(r) < 3 {
return r
}
return -1
}, text)
fmt.Println(string(out))
This outputs the same. Try this one on the Go Playground.

Also I found that character ⚤ in regex can match by \xE2\x9A\xA4 instead of wrong: \x{E29AA4}

Related

Doing base64 decoding on a string in Go

I have a particular string that I need to run base64 decode on in Go. This string looks something like this:
qU4aaakFmjaaaaI5aaa\/EN\/aaa\/SaaaJaaa6aa+nGnk=
Please note this is not the exact same string but it does have the same shape and number of characters, padding characters and it has those \/ things on the same positions in the string.
Let's call it key.
In PHP if I run
base64_decode($key);
the decode operation is successful
If In Python I run
base64.b64decode(key)
the decode operation is once more successful. Problem is, I can't do base64 decoding on this thing in Go.
dcd, err := base64.StdEncoding.DecodeString("qU4aaakFmjaaaaI5aaa\\/EN\\/aaa\\/SaaaJaaa6aa+nGnk=")
if err != nil {
log.Fatal(err)
}
return dcd
This will return the error
illegal base64 data at input byte 19
In the Go version, I have to escape those backslashes. It seems that the error appears at byte 19. Bearing in mind that this string that I am using as an example has the same length as the string that is actually causing the problem I would believe that the error happens right at the byte with the \ character. What can I do about this?
The alphabet of the standard Base64 does not contain backslash. So the qU4aaakFmjaaaaI5aaa\/EN\/aaa\/SaaaJaaa6aa+nGnk= input is not valid Base64 encoded string.
The forward slash is valid character in Base64, just not the backslash. It's possible the \/ is a sequence designating a single slash. If so, replace the \/ sequences with a single / and you're good to go.
For example:
s := `qU4aaakFmjaaaaI5aaa\/EN\/aaa\/SaaaJaaa6aa+nGnk=`
s = strings.ReplaceAll(s, `\/`, `/`)
dcd, err := base64.StdEncoding.DecodeString(s)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(dcd))
Which outputs (try it on the Go Playground):
�Ni��6�i�9i����i��i��i��i��y
If \/ is not a special sequence and you want to discard all invalid characters from the input, this is how it could be done:
var valid = map[rune]bool{}
func init() {
for _, r := range "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=" {
valid[r] = true
}
}
func clean(s string) string {
return strings.Map(func(r rune) rune {
if valid[r] {
return r
}
return -1
}, s)
}
func main() {
s := `qU4aaakFmjaaaaI5aaa\/EN\/aaa\/SaaaJaaa6aa+nGnk=`
s = clean(s)
dcd, err := base64.StdEncoding.DecodeString(s)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(dcd))
}
Output is the same. Try this one on the Go Playground.

How to convert a string to rune?

Here is my code snippet:
var converter = map[rune]rune {//some data}
sample := "⌘こんにちは"
var tmp string
for _, runeValue := range sample {
fmt.Printf("%+q", runeValue)
tmp = fmt.Sprintf("%+q", runeValue)
}
The output of fmt.Printf("%+q", runeValue) is:
'\u2318'
'\u3053'
'\u3093'
'\u306b'
'\u3061'
'\u306f'
These value are literally rune but as the return type of Sprintf is string, I cannot use it in my map which is [rune]rune.
I was wondering how can I convert string to rune, or in other words how can I handle this problem?
A string is not a single rune, it may contain multiple runes. You may use a simple type conversion to convert a string to a []runes containing all its runes like []rune(sample).
The for range iterates over the runes of a string, so in your example runeValue is of type rune, you may use it in your converter map, e.g.:
var converter = map[rune]rune{}
sample := "⌘こんにちは"
for _, runeValue := range sample {
converter[runeValue] = runeValue
}
fmt.Println(converter)
But since rune is an alias for int32, printing the above converter map will print integer numbers, output will be:
map[8984:8984 12371:12371 12385:12385 12395:12395 12399:12399 12435:12435]
If you want to print characters, use the %c verb of fmt.Printf():
fmt.Printf("%c\n", converter)
Which will output:
map[⌘:⌘ こ:こ ち:ち に:に は:は ん:ん]
Try the examples on the Go Playground.
If you want to replace (switch) certain runes in a string, use the strings.Map() function, for example:
sample := "⌘こんにちは"
result := strings.Map(func(r rune) rune {
if r == '⌘' {
return 'a'
}
if r == 'こ' {
return 'b'
}
return r
}, sample)
fmt.Println(result)
Which outputs (try it on the Go Playground):
abんにちは
If you want the replacements defined by a converter map:
var converter = map[rune]rune{
'⌘': 'a',
'こ': 'b',
}
sample := "⌘こんにちは"
result := strings.Map(func(r rune) rune {
if c, ok := converter[r]; ok {
return c
}
return r
}, sample)
fmt.Println(result)
This outputs the same. Try this one on the Go Playground.
Convert string to rune array:
runeArray := []rune("пример")

Converting unicode to "java

I have this a problem with character conversion. It all starts with this string: U+1F618. According to fileformat.info, this string is now (almost) in the HTML Entity (hex) notation.
But I need this character to be converted into a C/C++/Java source code-notation. I really don't know if this is the official name for the notation, but I assume this site to be correct :).
So basically my question is, instead of outputting to the real emoji, how can I get the value \uD83D\uDE18?
package main
import (
"fmt"
"html"
"strconv"
"strings"
)
func main() {
original := "\\U0001f618"
// Hex String
h := strings.ReplaceAll(original, "\\U", "0x")
// Hex to Int
i, _ := strconv.ParseInt(h, 0, 64)
// Unescape the string (HTML Entity -> String).
str := html.UnescapeString(string(i))
// Display the emoji.
fmt.Println(str)
// but I want something like this: \uD83D\uDE18
}
If you have the input as a string, e.g.
s := "\\U0001f618"
You may use strconv.Unquote() to unquote it. Be sure the string you pass to it is quoted (it must be wrapped with backticks or double quotes):
s2, err := strconv.Unquote(`"` + s + `"`)
fmt.Println(s2, err)
This will give you an s2 string that contains your emoji:
😘 <nil>
Java's string model is a char[] which contains the UTF-16 code points. Go's memory model of string is the UTF-8 encoded byte sequence.
To convert a Go string to UTF-16, you may use the unicode/utf16 package of the standard lib. For example utf16.Encode() encodes a series of runes (unicode codepoints) to UTF-16. You get a series of runes from a Go string with a simple type conversion: []rune("some string").
u16 := utf16.Encode([]rune(s2))
fmt.Printf("%X\n", u16)
The above prints the UTF16 codepoints in hexadecimal format:
[D83D DE18]
To get the format you want, use this loop:
buf := &strings.Builder{}
for _, v := range u16 {
fmt.Fprintf(buf, "\\u%X", v)
}
fmt.Println(buf.String())
Which outputs:
\uD83D\uDE18
Try the examples on the Go Playground.
You can capture this series of conversions in a function:
func convert(s string) (string, error) {
s2, err := strconv.Unquote(`"` + s + `"`)
if err != nil {
return "", err
}
buf := &strings.Builder{}
for _, v := range utf16.Encode([]rune(s2)) {
fmt.Fprintf(buf, "\\u%X", v)
}
return buf.String(), nil
}
Using it:
fmt.Println(convert("\\U0001f618"))
Which outputs (try it on the Go Playground):
\uD83D\uDE18 <nil>

Golang converting from rune to string

I have the following code, it is supposed to cast a rune into a string and print it. However, I am getting undefined characters when it is printed. I am unable to figure out where the bug is:
package main
import (
"fmt"
"strconv"
"strings"
"text/scanner"
)
func main() {
var b scanner.Scanner
const a = `a`
b.Init(strings.NewReader(a))
c := b.Scan()
fmt.Println(strconv.QuoteRune(c))
}
That's because you used Scanner.Scan() to read a rune but it does something else. Scanner.Scan() can be used to read tokens or runes of special tokens controlled by the Scanner.Mode bitmask, and it returns special constants form the text/scanner package, not the read rune itself.
To read a single rune use Scanner.Next() instead:
c := b.Next()
fmt.Println(c, string(c), strconv.QuoteRune(c))
Output:
97 a 'a'
If you just want to convert a single rune to string, use a simple type conversion. rune is alias for int32, and converting integer numbers to string:
Converting a signed or unsigned integer value to a string type yields a string containing the UTF-8 representation of the integer.
So:
r := rune('a')
fmt.Println(r, string(r))
Outputs:
97 a
Also to loop over the runes of a string value, you can simply use the for ... range construct:
for i, r := range "abc" {
fmt.Printf("%d - %c (%v)\n", i, r, r)
}
Output:
0 - a (97)
1 - b (98)
2 - c (99)
Or you can simply convert a string value to []rune:
fmt.Println([]rune("abc")) // Output: [97 98 99]
There is also utf8.DecodeRuneInString().
Try the examples on the Go Playground.
Note:
Your original code (using Scanner.Scan()) works like this:
You called Scanner.Init() which sets the Mode (b.Mode) to scanner.GoTokens.
Calling Scanner.Scan() on the input (from "a") returns scanner.Ident because "a" is a valid Go identifier:
c := b.Scan()
if c == scanner.Ident {
fmt.Println("Identifier:", b.TokenText())
}
// Output: "Identifier: a"
I know I'm a bit late to the party but here's a []rune to string function:
func runesToString(runes []rune) (outString string) {
// don't need index so _
for _, v := range runes {
outString += string(v)
}
return
}
yes, there is a named return but I think it's ok in this case as it reduces the number of lines and the function is only short
This simple code works in converting a rune to a string
s := fmt.Sprintf("%c", rune)
Since I came to this question searching for rune and string and char, thought this may help newbies like me
// str := "aഐbc"
// testString(str)
func testString(oneString string){
//string to byte slice - No sweat -just type cast it
// As string IS A byte slice
var twoByteArr []byte = []byte(oneString)
// string to rune Slices - No sweat
// string IS A slice of runes
var threeRuneSlice []rune = []rune(oneString)
// Hmm! String seems to have a dual personality it is both a slice of bytes and
// a slice of runes - yeah - read on
// A rune slice can be convered to string -
// No sweat - as string == rune slice
var thrirdString string = string(threeRuneSlice)
// There is a catch here and that is in printing "characters", using for loop and range
fmt.Println("Chars in oneString")
for i,r := range oneString {
fmt.Printf(" %d %v %c ",i,r,r) //you may not get index 0,1,2,3 here
// since the range runs specially over strings https://blog.golang.org/strings
}
fmt.Println("\nChars in threeRuneSlice")
for i,r := range threeRuneSlice {
fmt.Printf(" %d %v %c ",i,r,r) // i = 0,1,2,4 , perfect!!
// as runes are made up of 4 bytes (rune is int32 and byte in unint8
// and a set of bytes is used to represent a rune which is used to
// represent UTF characters == the REAL CHARECTER
}
fmt.Println("\nValues in oneString ")
for j := 0; j < len(oneString); j++ {
fmt.Printf(" %d %v ",j,oneString[j]) // No you cannot get charecters if you iterate through string in this way
// as you are going over bytes here - not runes
}
fmt.Println("\nValues in twoByteArr")
for j := 0; j < len(twoByteArr); j++ {
fmt.Printf(" %d=%v ",j,twoByteArr[j]) // == same as above
}
fmt.Printf("\none - %s, two %s, three %s\n",oneString,twoByteArr,thrirdString)
}
And some more pointless demo https://play.golang.org/p/tagRBVG8k7V
adapted from https://groups.google.com/g/golang-nuts/c/84GCvDBhpbg/m/Tt6089MPFQAJ
to show that the 'characters' are encoded with one to up to 4 bytes depending on the unicode code point
Provide simple examples to understand how to do it quickly.
// rune => string
fmt.Printf("%c\n", 65) // A
fmt.Println(string(rune(0x1F60A))) // 😊
fmt.Println(string([]rune{0x1F468, 0x200D, 0x1F9B0})) // 👨‍🦰
// string => rune
fmt.Println(strconv.FormatUint(uint64([]rune("😊")[0]), 16)) // 1f60a
fmt.Printf("%U\n", '😊') // U+1F60A
fmt.Printf("%U %U %U\n", '👨', '‍', '🦰') // U+1F468 U+200D U+1F9B0
go playground

Convert string to binary in Go

How do you convert a string to its binary representation in Go?
Example:
Input: "A"
Output: "01000001"
In my testing, fmt.Sprintf("%b", 75) only works on integers.
Cast the 1-character string to a byte in order to get its numerical representation.
s := "A"
st := fmt.Sprintf("%08b", byte(s[0]))
fmt.Println(st)
Output: "01000001"
(Bear in mind code "%b" (without number in between) causes leading zeros in output to be dropped.)
You have to iterate over the runes of the string:
func toBinaryRunes(s string) string {
var buffer bytes.Buffer
for _, runeValue := range s {
fmt.Fprintf(&buffer, "%b", runeValue)
}
return fmt.Sprintf("%s", buffer.Bytes())
}
Or over the bytes:
func toBinaryBytes(s string) string {
var buffer bytes.Buffer
for i := 0; i < len(s); i++ {
fmt.Fprintf(&buffer, "%b", s[i])
}
return fmt.Sprintf("%s", buffer.Bytes())
}
Live playground:
http://play.golang.org/p/MXZ1Y17xWa

Resources