Creating a substring in go creates a new kind of symbol - string

I am comparing strings and there is the following:
Please note that the " in front of NEW are different.
Now when calling my function like this:
my_func(a[18:], b[18:])
The resulting strings are surprisingly:
What do I have to do to cut this weird symbol away and why is it behaving like this?

Because that type of quote is a multibyte character, and you are splitting the string in the middle of a character. What you could do is convert to an []rune and then convert back:
https://play.golang.org/p/pw42sEwRTZd
s := "H界llo"
fmt.Println(s[1:3]) // ��
fmt.Println(string([]rune(s)[1:3])) // 界l

Another option is the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString(` 'Not Available') “NEW CREDIT" FROM customers;`)
t := s.Slice(18, s.RuneCount())
println(t == `“NEW CREDIT" FROM customers;`)
}
https://pkg.go.dev/golang.org/x/exp/utf8string

Related

GO flag pkg reading option string containing escaped runes like "\u00FC" won't read

The test program below works as desired using the DEFAULT string having code points like \u00FC,
as well as if that type of code point is coded as a sting within the prog. Passing the same string from cmd line like: prog.exe -input="ABC\u00FC" does NOT. I assumed it was os interaction so
tried other quoting, even wrapping like: "(ABC\u00FC)" and trimming the parens inside the func NG.
Is the "for _, runeRead := range []rune" incorrect for escaped values?
package main
import (
"fmt"
"flag"
"os"
)
var input string
var m = make(map[rune]struct{})
func init() {
flag.StringVar(&input, "input", "A7\u00FC", "string of runes")
m['A'] = struct{}{}
m['\u00FC'] = struct{}{}
m['7'] = struct{}{}
}
func main() {
flag.Parse()
ck(input) // cmd line - with default OK
ck("A\u00FC") // hard code - OK
}
func ck(in string) {
for _, runeRead := range []rune(in) {
fmt.Printf("DEBUG: Testing rune: %v %v\n", string(runeRead), byte(runeRead))
if _, ok := m[runeRead]; ! ok {
fmt.Printf("\nERROR: Invalid entry <%v>, in string <%s>.\n", string(runeRead), in)
os.Exit(9)
}
}
}
Soluntion needs to work windows and linux.
https://ss64.com/nt/syntax-esc.html
^ Escape character.
Adding the escape character before a command symbol allows it to be treated as ordinary text.
When piping or redirecting any of these characters you should prefix with the escape character: & \ < > ^ |
e.g. ^\ ^& ^| ^> ^< ^^
So you should do
prog.exe -input="ABC^\u00FC"
in case it helps others
It apparently is that different OSs and/or shells (in my case bash) are having issue with the the "\u" of the unicode character. In bash at the cmd line the user could enter
$' the characters ' to protect the \u. It was suggested that WITHIN the program if a string had the same issue that the strconv.Quote could have been a solution.
Since I wanted an OS/shell independent solution for non-computer savvy users, I did a slightly more involved workaround.
I tell users to enter the unicode that needs the \u format to use %FC instead of \u00FC. I parse the string from the command line i.e. ABC%FC%F6123 with rexexp and inside my GO code I replace the %xx with the unicode rune as I had originally expected to get it. With a few lines of code the user input is now OS agnostic.

Why does golang bytes.Buffer behave in such way?

I recently faced a problem, where I'm writing to a byte.Buffer using a writer. But when I do String() on that byte.Buffer I'm getting an unexpected output (extra pair of double quotes added). Can you please help me understand it?
Here is a code snippet of my problem! I just need help understanding why each word is surrounded by a double quote.
func main() {
var csvBuffer bytes.Buffer
wr := csv.NewWriter(&csvBuffer)
data := []string{`{"agent":"python-requests/2.19.1","api":"/packing-slip/7123"}`}
err := wr.Write(data)
if err != nil {
fmt.Println("WARNING: unable to write ", err)
}
wr.Flush()
fmt.Println(csvBuffer.String())
}
Output:
{""agent"":""python-requests/2.19.1"",""api"":""/packing-slip/7123""}
In CSV double quotes (") are escaped as 2 double quotes. That's what you see.
You encode a single string value which contains double quotes, so all those are replaced with 2 double quotes.
When decoded, the result will contain 1 double quotes of course:
r := csv.NewReader(&csvBuffer)
rec, err := r.Read()
fmt.Println(rec, err)
Outputs (try it on the Go Playground):
[{"agent":"python-requests/2.19.1","api":"/packing-slip/7e0a05b3"}] <nil>
Quoting from package doc of encoding/csv:
Within a quoted-field a quote character followed by a second quote character is considered a single quote.
"the ""word"" is true","a ""quoted-field"""
results in
{`the "word" is true`, `a "quoted-field"`}
In CSV, the following are equivalent:
one,two
and
"one","two"
Now if the values would contain double quotes, that would indicate the end of the value. CSV handles this by substituting double quotes with 2 of them. The value one"1 is encoded as one""1 in CSV, e.g.:
"one""1","two""2"

Golang: Issues replacing newlines in a string from a text file

I've been trying to have a File be read, which will then put the read material into a string. Then the string will get split by line into multiple strings:
absPath, _ := filepath.Abs("../Go/input.txt")
data, err := ioutil.ReadFile(absPath)
if err != nil {
panic(err)
}
input := string(data)
The input.txt is read as:
a
strong little bird
with a very
big heart
went
to school one day and
forgot his food at
home
However,
re = regexp.MustCompile("\\n")
input = re.ReplaceAllString(input, " ")
turns the text into a mangled mess of:
homeot his food atand
I'm not sure how replacing newlines can mess up so badly to the point where the text inverts itself
I guess that you are running the code using Windows. Observe that if you print out the length of the resulting string, it will show something over 100 characters. The reason is that Windows uses not only newlines (\n) but also carriage returns (\r) - so a newline in Windows is actually \r\n, not \n. To properly filter them out of your string, use:
re := regexp.MustCompile(`\r?\n`)
input = re.ReplaceAllString(input, " ")
The backticks will make sure that you don't need to quote the backslashes in the regular expression. I used the question mark for the carriage return to make sure that your code works on other platforms as well.
I do not think that you need to use regex for such an easy task. This can be achieved with just
absPath, _ := filepath.Abs("../Go/input.txt")
data, _ := ioutil.ReadFile(absPath)
input := string(data)
strings.Replace(input, "\n","",-1)
example of removing \n

How to get a single Unicode character from string

I wonder how I can I get a Unicode character from a string. For example, if the string is "你好", how can I get the first character "你"?
From another place I get one way:
var str = "你好"
runes := []rune(str)
fmt.Println(string(runes[0]))
It does work.
But I still have some questions:
Is there another way to do it?
Why in Go does str[0] not get a Unicode character from a string, but it gets byte data?
First, you may want to read https://blog.golang.org/strings
It will answer part of your questions.
A string in Go can contains arbitrary bytes. When you write str[i], the result is a byte, and the index is always a number of bytes.
Most of the time, strings are encoded in UTF-8 though. You have multiple ways to deal with UTF-8 encoding in a string.
For instance, you can use the for...range statement to iterate on a string rune by rune.
var first rune
for _,c := range str {
first = c
break
}
// first now contains the first rune of the string
You can also leverage the unicode/utf8 package. For instance:
r, size := utf8.DecodeRuneInString(str)
// r contains the first rune of the string
// size is the size of the rune in bytes
If the string is encoded in UTF-8, there is no direct way to access the nth rune of the string, because the size of the runes (in bytes) is not constant. If you need this feature, you can easily write your own helper function to do it (with for...range, or with the unicode/utf8 package).
You can use the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("ÄÅàâäåçèéêëìîïü")
// example 1
r := s.At(1)
println(r == 'Å')
// example 2
t := s.Slice(1, 3)
println(t == "Åà")
}
https://pkg.go.dev/golang.org/x/exp/utf8string
you can do this:
func main() {
str := "cat"
var s rune
for i, c := range str {
if i == 2 {
s = c
}
}
}
s is now equal to a

Indexing string as chars

The elements of strings have type byte and may be accessed using the
usual indexing operations.
How can I get element of string as char ?
"some"[1] -> "o"
The simplest solution is to convert it to an array of runes :
var runes = []rune("someString")
Note that when you iterate on a string, you don't need the conversion. See this example from Effective Go :
for pos, char := range "日本語" {
fmt.Printf("character %c starts at byte position %d\n", char, pos)
}
This prints
character 日 starts at byte position 0
character 本 starts at byte position 3
character 語 starts at byte position 6
Go strings are usually, but not necessarily, UTF-8 encoded. In the case they are Unicode strings, the term "char[acter]" is pretty complex and there is no generall/unique bijection of runes (code points) and Unicode characters.
Anyway one can easily work with code points (runes) in a slice and use indexes into it using a conversion:
package main
import "fmt"
func main() {
utf8 := "Hello, 世界"
runes := []rune(utf8)
fmt.Printf("utf8:% 02x\nrunes: %#v\n", []byte(utf8), runes)
}
Also here: http://play.golang.org/p/qWVSA-n93o
Note: Often the desire to access Unicode "characters" by index is a design mistake. Most of textual data is processed sequentially.
Another option is the package utf8string:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("🧡💛💚💙💜")
t := s.At(2)
println(t == '💚')
}
https://pkg.go.dev/golang.org/x/exp/utf8string

Resources