Iterating over go string and making string from chars in go - string

I started learning go and I want to implement some algorithm. I can iterate over strings and then get chars, but these chars are Unicode numbers.
How to concatenate chars into strings in go? Do you have some reference? I was unable to find anything about primitives in official page.

Iterating over strings using range gives you Unicode characters while
iterating over a string using an index gives you bytes. See the spec for
runes and strings as well as their conversions.
As The New Idiot mentioned, strings can be concatenated using the +
operator.
The conversion from character to string is two-fold. You can convert
a byte (or byte sequence) to a string:
string(byte('A'))
or you can convert a rune (or rune sequence) to a string:
string(rune('µ'))
The difference is that runes represent Unicode characters while bytes represent
8 bit values.
But all of this is mentioned in the respective sections of the spec I linked above.
It's quite easy to understand, you should definitely read it.

you can convert a []rune to a string directly:
string([]rune{'h', 'e', 'l', 'l', 'o', '☃'})
http://play.golang.org/p/P9vKXlo47c
as for reference, it's in the Conversions section of the Go spec, in the section titled "Conversions to and from a string type"
http://golang.org/ref/spec#Conversions
as for concatenation, you probably don't want to concatenate every single character with the + operator, since that will perform a lot of copying under the hood. If you're getting runes in one at a time and you're not building an intermediate slice of runes, you most likely want to use a bytes.Buffer, which has a WriteRune method for this sort of thing. http://golang.org/pkg/bytes/#Buffer.WriteRune

Use +
str:= str + "a"
You can try something like this :
string1 := "abc"
character1 := byte('A')
string1 += string(character1)
Even this answer might be of help.

definetly worth reading #nemo's post
Iterating over strings using range gives you Unicode characters while iterating over a string using an index gives you bytes. See the spec for runes and strings as well as their conversions.
Strings can be concatenated using the + operator.
The conversion from character to string is two-fold. You can convert a byte (or byte sequence) to a string:
string(byte('A'))
or you can convert a rune (or rune sequence) to a string:
string(rune('µ'))

Related

Golang what is the optimize way to append and remove character

In go, we have the strings.Builder to append characters which is better than using s = s + string(character), but how about is there an optimal way to remove the last character instead of s = s[:len(s)-sizeOfLastCharacter]?
Slicing is a very efficient operation, as detailed in slice internals:
Slicing does not copy the slice's data. It creates a new slice value that points to the original array. This makes slice operations as efficient as manipulating array indices.
Effectively, removing the last element of a slice means creating a new slice descriptor pointing to the same array but with a smaller length. Short of direct access to the internals of a slice, you won't find a more efficient solution.
Use utf8.DecodeLastRuneInString() to find out how many bytes the last rune "occupies", and slice the original string based on that. Slicing a string results in a string value that shares the backing array with the original, so the string content is not copied, just a new string header is created which is just 2 integer values (see reflect.StringHeader).
For example:
s := "Hello, 世界"
r, size := utf8.DecodeLastRuneInString(s)
if r != utf8.RuneError {
s = s[:len(s)-size]
}
fmt.Println(s)
Outputs (try it on the Go Playground):
Hello, 世

What determines the position of a character when looping through UTF-8 strings?

I am reading the section on for statements in the Effective Go documentation and came across this example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
The output is:
Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7
What I don't understand is why the positions are 0, 3, 6, and 7. This tells me the first and second character is 3 bytes long and the 'replacement rune' (U+FFFD) is 1 byte long, which I accept and understand. However, I thought rune was of int32 type and therefore would be 4 bytes each, not three.
Why are the positions in a range different to the total amount of memory each value should be consuming?
string values in Go are stored as read only byte slices ([]byte), where the bytes are the UTF-8 encoded bytes of the (runes of the) string. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127 are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8 package contains UTF-8 related utility functions and constants, for example utf8.UTFMax reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).
One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string value "\xff" represents an invalid UTF-8 byte sequence, for details, see How do I represent an Optional String in Go?
The for range construct –when applied on a string value– iterates over the runes of the string:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
The for range construct may produce 1 or 2 iteration values. When using 2, like in your example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
For each iteration, pos will be byte index of the rune / character, and char will be the rune of the string. As you can see in the quote above, if the string is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char will be 0xFFFD (the Unicode replacement character), and the for range construct (the iteration) will advance a singe byte only.
To sum it up: The position is always the byte index of the rune of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.
A must-read blog post if you want to know more about the topic:
The Go Blog: Strings, bytes, runes and characters in Go
rune is code point. Code point is just integer. You can even use int64 to store it if you want to. (But Unicode only has 1,114,112 code points so int32 should be the right choice. No wonder rune is alias of int32 in Golang.)
Different encoding schemes encode code points in different ways. E.g. CJK character is usually encoded to 3 bytes in UTF-8, and to 2 bytes in UTF-16.
String literal in Golang is UTF-8.

How to convert string like "//u****" to text?

I want to convert a string like "//u****" to text (unicode) in Haskell.
I have a Java propertyes file, and it has the following content:
i18n.test.key=\u0050\u0069\u006e\u0067\u0020\uc190\uc2e4\ub960\u0020\ud50c\ub7ec\uadf8\uc778
I wanna convert it to text (Unicode) in Haskell.
I think I can do it like this:
Convert "\u****" to word8 array
Convert word8 array to ByteString
Use Text.Encoding.decodeUtf8 convert ByteString to text
But step 1 is little complicated for me.
How to do it in Haskell?
A simple solution may look like this:
decodeJava = T.decodeUtf16BE . BS.concat . gobble
gobble [] = []
gobble ('\\':'u':a:b:c:d:rest) = let sym = convert16 [a,b] [c,d]
in sym : gobble rest
gobble _ = error "decoding error"
convert16 hi lo = BS.pack [read $ "0x"++hi, read $ "0x"++lo]
Notes:
Your string is UTF16-encoded, therefore you need decodeUtf16BE.
Decoding will fail if there are other characters in the string. This code will work with your example only if you remove the trailing i.
Constructing the words by appending 0x and, in particular, using read is very slow, but will do the trick for small data.
If you replace \u with \x then this is a valid Haskell string literal.
my_string = "\x0050\x0069\x006e..."
You can then convert to Text if you want, or leave it as String, or whatever.
Watch out, Java normally uses UTF-16 to encode its strings, so interpreting the bytes as UTF-8 will probably not work.
If the codes in your file are UTF-16, you need to do the following:
find the numeric value (Unicode code point) for each quadrupel
check if this is a high surrogate character. If this is so, the following character will be a low surrogate character. The pair of surrogate characters can be mapped to a Unicode point.
make a String from your list of unicode numbers with map fromEnum
The following is a quote from the Java doc http://docs.oracle.com/javase/7/docs/api/ :
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Java has methods to combine a high surrogate character and a low surrogate character to get the Unicode point. You may want to check the source of the java.lang.Character class to find out how exactly they do this, but I guess it is some simple bit-operation.
Another possibility would be to check for a Haskell library that does UTF-16 decoding.

Convert a char to upper case

I have a variable which contains a single char. I want to convert this char to upper case. However, the to_uppercase function returns a rustc_unicode::char::ToUppercase struct instead of a char.
Explanation
ToUppercase is an Iterator, that may yield more than one char. This is necessary, because some Unicode characters consist of multiple "Unicode Scalar Values" (which a Rust char represents).
A nice example are the so called ligatures. Try this for example (on playground):
let fi_upper: Vec<_> = 'fi'.to_uppercase().collect();
println!("{:?}", fi_upper); // prints: ['F', 'I']
The 'fi' ligature is a single character whose uppercase version consists of two letters/characters.
Solution
There are multiple possibilities how to deal with that:
Work on &str: if your data is actually in string form, use str::to_uppercase which returns a String which is easier to work with.
Use ASCII methods: if you are sure that your data is ASCII only and/or you don't care about unicode symbols you can use std::ascii::AsciiExt::to_ascii_uppercase which returns just a char. But it only changes the letters 'a' to 'z' and ignores all other characters!
Deal with it manually: Collect into a String or Vec like in the example above.
ToUppercase is an iterator, because the uppercase version of the character may be composed of several codepoints, as delnan pointed in the comments. You can convert that to a Vector of characters:
c.to_uppercase().collect::<Vec<_>>();
Then, you should collect those characters into a string, as ker pointed.

What is the difference between the string and []byte in Go?

s := "some string"
b := []byte(s) // convert string -> []byte
s2 := string(b) // convert []byte -> string
what is the difference between the string and []byte in Go?
When to use "he" or "she"?
Why?
bb := []byte{'h','e','l','l','o',127}
ss := string(bb)
fmt.Println(ss)
hello
The output is just "hello", and lack of 127, sometimes I feel that it's weird.
string and []byte are different types, but they can be converted to one another:
3 . Converting a slice of bytes to a string type yields a string whose successive bytes are the elements of the slice.
4 . Converting a value of a string type to a slice of bytes type yields a slice whose successive elements are the bytes of the string.
Blog: Arrays, slices (and strings): The mechanics of 'append':
Strings are actually very simple: they are just read-only slices of bytes with a bit of extra syntactic support from the language.
Also read: Strings, bytes, runes and characters in Go
When to use one over the other?
Depends on what you need. Strings are immutable, so they can be shared and you have guarantee they won't get modified.
Byte slices can be modified (meaning the content of the backing array).
Also if you need to frequently convert a string to a []byte (e.g. because you need to write it into an io.Writer()), you should consider storing it as a []byte in the first place.
Also note that you can have string constants but there are no slice constants. This may be a small optimization. Also note that:
The expression len(s) is constant if s is a string constant.
Also if you are using code already written (either standard library, 3rd party packages or your own), in most of the cases it is given what parameters and values you have to pass or are returned. E.g. if you read data from an io.Reader, you need to have a []byte which you have to pass to receive the read bytes, you can't use a string for that.
This example:
bb := []byte{'h','e','l','l','o',127}
What happens here is that you used a composite literal (slice literal) to create and initialize a new slice of type []byte (using Short variable declaration). You specified constants to list the initial elements of the slice. You also used a byte value 127 which - depending on the platform / console - may or may not have a visual representation.
Late but i hope this could help.
In simple words
Bit: 0 and 1 is how machines represents all the information
Byte: 8 bits that represents UTF-8 encodings i.e. characters
[ ]type: slice of a given data type. Slices are dynamic size arrays.
[ ]byte: this is a byte slice i.e. a dynamic size array that contains bytes i.e. each element is a UTF-8 character.
String: read-only slices of bytes i.e. immutable
With all this in mind:
s := "Go"
bs := []byte(s)
fmt.Printf("%s", bs) // Output: Go
fmt.Printf("%d", bs) // Output: [71 111]
or
bs := []byte{71, 111}
fmt.Printf("%s", bs) // Output: Go
%s converts byte slice to string
%d gets UTF-8 decimal value of bytes
IMPORTANT:
As strings are immutable, they cannot be changed within memory, each time you add or remove something from a string, GO creates a new string in memory. On the other hand, byte slices are mutable so when you update a byte slice you are not recreating new stuffs in memory.
So choosing the right structure could make a difference in your app performance.

Resources