I'm curious as why Go doesn't provide a []byte(*string) method. From a performance perspective, wouldn't []byte(string) make a copy of the input argument and add more cost (though this seems odd since strings are immutable, why copy them)?
[]byte("something") is not a function (or method) call, it's a type conversion.
The type conversion "itself" does not copy the value. Converting a string to a []byte however does, and it needs to, because the result byte slice is mutable, and if a copy would not be made, you could modify / alter the string value (the content of the string) which is immutable, it must be as the Spec: String types section dictates:
Strings are immutable: once created, it is impossible to change the contents of a string.
Note that there are few cases when string <=> []byte conversion does not make a copy as it is optimized "away" by the compiler. These are rare and "hard coded" cases when there is proof an immutable string cannot / will not end up modified.
Such an example is looking up a value from a map where the key type is string, and you index the map with a []byte, converted to string of course (source):
key := []byte("some key")
var m map[string]T
// ...
v, ok := m[string(key)] // Copying key here is optimized away
Another optimization is when ranging over the bytes of a string that is explicitly converted to a byte slice:
s := "something"
for i, v := range []byte(s) { // Copying s is optimized away
// ...
}
(Note that without the conversion the for range would iterate over the runes of the string and not over its UTF8-encoded bytes.)
I'm curious as why Golang doesn't provide a []byte(*string) method.
Because it doesn't make sense.
A pointer (to any type) cannot be represented (in any obviously meaningful way) as a []byte.
From a performance perspective, wouldn't []byte(string) make a copy of the input argument and add more cost (though this seems odd since strings are immutable, why copy them)?
Converting from []byte to string (and vice versa) does involve a copy, because strings are immutable, but byte arrays are not.
However, using a pointer wouldn't solve that problem.
Related
I am trying to efficiently count runes from a utf-8 string using the utf8 library. Is this example optimal in that it does not copy the underlying data?
https://golang.org/pkg/unicode/utf8/#example_DecodeRuneInString
func main() {
str := "Hello, 世界" // let's assume a runtime-provided string
for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%c %v\n", r, size)
str = str[size:] // performs copy?
}
}
I found StringHeader in the (unsafe) reflect library. Is this the exact structure of a string in Go? If so, it is conceivable that slicing a string merely updates Data or allocates a new StringHeader altogether.
type StringHeader struct {
Data uintptr
Len int
}
Bonus: where can I find the code that performs string slicing so that I could look it up myself? Any of these?
https://golang.org/src/runtime/slice.go
https://golang.org/src/runtime/string.go
This related SO answer suggests that runtime-strings incur a copy when converted from string to []byte.
Slicing Strings
does slice of string perform copy of underlying data?
No it does not. See this post by Russ Cox:
A string is represented in memory as a 2-word structure containing a pointer to the string data and a length. Because the string is immutable, it is safe for multiple strings to share the same storage, so slicing s results in a new 2-word structure with a potentially different pointer and length that still refers to the same byte sequence. This means that slicing can be done without allocation or copying, making string slices as efficient as passing around explicit indexes.
-- Go Data Structures
Slices, Performance, and Iterating Over Runes
A slice is basically three things: a length, a capacity, and a pointer to a location in an underlying array.
As such, slices themselves are not very large: ints and a pointer (possibly some other small things in implementation detail). So the allocation required to make a copy of a slice is very small, and doesn't depend on the size of the underlying array. And no new allocation is required when you simply update the length, capacity, and pointer location, such as on line 2 of:
foo := []int{3, 4, 5, 6}
foo = foo[1:]
Rather, it's when a new underlying array has to be allocated that a performance impact is felt.
Strings in Go are immutable. So to change a string you need to make a new string. However, strings are closely related to byte slices, e.g. you can create a byte slice from a string with
foo := `here's my string`
fooBytes := []byte(foo)
I believe that will allocate a new array of bytes, because:
a string is in effect a read-only slice of bytes
according to the Go Blog (see Strings, bytes, runes and characters in Go). In general you can use a slice to change the contents of an underlying array, so to produce a usable byte slice from a string you would have to make a copy to keep the user from changing what's supposed to be immutable.
You could use performance profiling and benchmarking to gain further insight into the performance of your program.
Once you have your slice of bytes, fooBytes, reslicing it does not allocate a new array, it just allocates a new slice, which is small. This appears to be what slicing a string does as well.
Note that you don't need to use the utf8 package to count words in a utf8 string, though you may proceed that way if you like. Go handles utf8 natively. However if you want to iterate over characters you can't represent the string as a slice of bytes, because you could have multibyte characters. Instead you need to represent it as a slice of runes:
foo := `here's my string`
fooRunes := []rune(foo)
This operation of converting a string to a slice of runes is fast in my experience (trivial in benchmarks I've done, but there may be an allocation). Now you can iterate across fooRunes to count words, no utf8 package required. Alternatively, you can skip the explicit []rune(foo) conversion and do it implicitly by using a for ... range loop on the string, because those are special:
A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value.
-- Strings, bytes, runes and characters in Go
s := "some string"
b := []byte(s) // convert string -> []byte
s2 := string(b) // convert []byte -> string
what is the difference between the string and []byte in Go?
When to use "he" or "she"?
Why?
bb := []byte{'h','e','l','l','o',127}
ss := string(bb)
fmt.Println(ss)
hello
The output is just "hello", and lack of 127, sometimes I feel that it's weird.
string and []byte are different types, but they can be converted to one another:
3 . Converting a slice of bytes to a string type yields a string whose successive bytes are the elements of the slice.
4 . Converting a value of a string type to a slice of bytes type yields a slice whose successive elements are the bytes of the string.
Blog: Arrays, slices (and strings): The mechanics of 'append':
Strings are actually very simple: they are just read-only slices of bytes with a bit of extra syntactic support from the language.
Also read: Strings, bytes, runes and characters in Go
When to use one over the other?
Depends on what you need. Strings are immutable, so they can be shared and you have guarantee they won't get modified.
Byte slices can be modified (meaning the content of the backing array).
Also if you need to frequently convert a string to a []byte (e.g. because you need to write it into an io.Writer()), you should consider storing it as a []byte in the first place.
Also note that you can have string constants but there are no slice constants. This may be a small optimization. Also note that:
The expression len(s) is constant if s is a string constant.
Also if you are using code already written (either standard library, 3rd party packages or your own), in most of the cases it is given what parameters and values you have to pass or are returned. E.g. if you read data from an io.Reader, you need to have a []byte which you have to pass to receive the read bytes, you can't use a string for that.
This example:
bb := []byte{'h','e','l','l','o',127}
What happens here is that you used a composite literal (slice literal) to create and initialize a new slice of type []byte (using Short variable declaration). You specified constants to list the initial elements of the slice. You also used a byte value 127 which - depending on the platform / console - may or may not have a visual representation.
Late but i hope this could help.
In simple words
Bit: 0 and 1 is how machines represents all the information
Byte: 8 bits that represents UTF-8 encodings i.e. characters
[ ]type: slice of a given data type. Slices are dynamic size arrays.
[ ]byte: this is a byte slice i.e. a dynamic size array that contains bytes i.e. each element is a UTF-8 character.
String: read-only slices of bytes i.e. immutable
With all this in mind:
s := "Go"
bs := []byte(s)
fmt.Printf("%s", bs) // Output: Go
fmt.Printf("%d", bs) // Output: [71 111]
or
bs := []byte{71, 111}
fmt.Printf("%s", bs) // Output: Go
%s converts byte slice to string
%d gets UTF-8 decimal value of bytes
IMPORTANT:
As strings are immutable, they cannot be changed within memory, each time you add or remove something from a string, GO creates a new string in memory. On the other hand, byte slices are mutable so when you update a byte slice you are not recreating new stuffs in memory.
So choosing the right structure could make a difference in your app performance.
I am getting input from the user, however when I try to compare it later on to a string literal it does not work. That is just a test though.
I would like to set it up so that when a blank line is entered (just hitting the enter/return key) the program exits. I don't understand why the strings are not comparing because when I print it, it comes out identical.
in := bufio.NewReader(os.Stdin);
input, err := in.ReadBytes('\n');
if err != nil {
fmt.Println("Error: ", err)
}
if string(input) == "example" {
os.Exit(0)
}
string vs []byte
string definition:
string is the set of all strings of 8-bit bytes, conventionally but not necessarily representing UTF-8-encoded text. A string may be empty, but not nil. Values of string type are immutable.
byte definition:
byte is an alias for uint8 and is equivalent to uint8 in all ways. It is used, by convention, to distinguish byte values from 8-bit unsigned integer values.
What does it mean?
[]byte is a byte slice. slice can be empty.
string elements are unicode characters, which can have more then 1 byte.
string elements keep a meaning of data (encoding), []bytes not.
equality operator is defined for string type but not for slice type.
As you see they are two different types with different properties.
There is a great blog post explaining different string related types [1]
Regards the issue you have in your code snippet.
Bear in mind that in.ReadBytes(char) returns a byte slice with char inclusively. So in your code input ends with '\n'. If you want your code to work in desired way then try this:
if string(input) == "example\n" { // or "example\r\n" when on windows
os.Exit(0)
}
Also make sure that your terminal code page is the same as your .go source file. Be aware about different end-line styles (Windows uses "\r\n"), Standard go compiler uses utf8 internally.
[1] Comparison of Go data types for string processing.
I am trying to finish up a homework assignment, and am down to the last part. First, I'll show you the type that I am dealing with:
TYPE Book_Collection IS
RECORD
Books : Book_Collection_Array;
Max_Size : Integer;
Size : Integer;
END RECORD;
TYPE Book_Type IS
RECORD
Title,
Author,
Publisher : Title_Str;
Year : Year_Type;
Edition : Natural;
Isbn : Isbn_Type;
Price : Dollars;
Stock : Natural;
Format : Format_Type;
END RECORD;
Book_Collection_Array is an array of book_type. These are private types, so the array is bounded (1..200).
There is a function called ToString in a separate package that was provided to us, that takes a book_type as input, and returns a string of all the elements of book_type. What I need to create is a function that takes book_collection is a parameter, and returns a string concatenating all of the strings that are returned by the ToString function that was provided, for the book_types that exist in that book_collection. I have made multiple attempts, but am constantly getting range check failures. Can anyone point me in the right direction?
*Edit:
Thank you to both of you for your help. I went the route of using an unbounded string, and appending each string to it, then declaring an output string and setting it as a constant string equal to the the To_String of the unbounded_string.*
I'll give you a hint.
Ada strings ideally aren't treated or handled much like C or Java strings at all. C strings count on a trailing nul (0) character to designate the end of data in a buffer. Java strings keep track of their own length, and will dynamically reallocate themselves to keep to the proper length if need be. So typical string-handling idioms in those languages think nothing of progressively modifying a string variable.
Ada strings instead are expected to be perfectly-sized when created. Most routines will assume that every element in a string array contains valid character data, and any destination string you assign data into will be perfectly sized to hold it. If that isn't the case, usually an exception is raised (and most likely your program crashes).
There are several ways to deal with this when you are building a string. One way is to create a really big string object as a buffer, and keep a separate length variable to tell your code how much data is really in there at all times. Then when you call Ada string routines you can feed them just the slice of data from the string that is valid. eg: Put_line (My_New_String(1..My_String_Length));
A better way is to just deal with perfectly-sized constant strings. For example, if you want to tack String1 and String2 together, the safe Ada way to do this is:
My_New_String : constant String := String1 & String2;
Then if you later want a string that is this string with String3 tacked on:
My_New_New_String : constant String := My_New_String & String3;
For more info on this, I suggest you look at some of the links over on the right side of this browser window under the heading "Related". I see a lot of good stuff in there.
Ok, i've always kind of known that computers treat strings as a series of numbers under the covers, but i never really looked at the details of how it works. What sort of magic is going on in the average compiler/processor when we do, for instance, the following?
string myString = "foo";
myString += "bar";
print(myString) //replace with printing function of your choice
The answer is completely dependent on the language in question. But C is usually a good language to kind of see how things happen behind the scenes.
In C:
In C strings are array of char with a 0 at the end:
char str[1024];
strcpy(str, "hello ");
strcpy(str, "world!");
Behind the scenes str[0] == 'h' (which has an int value), str[1] == 'e', ...
str[11] == '!', str[12] == '\0';
A char is simply a number which can contain one of 256 values. Each character has a numeric value.
In C++:
strings are supported in the same way as C but you also have a string type which is part of STL.
string literals are part of static storage and cannot be changed directly unless you want undefined behavior.
It's implementation dependent how the string type actually works behind the scenes, but the string objects themselves are mutable.
In C#:
strings are immutable. Which means you can't directly change a string once it's created. When you do += what happen is a new string gets created and your string now references that new string.
The implementation varies between language and compiler of course, but typically for C it's something like the following. Note that strings are essentially syntactical sugar for char arrays (char[]) in C.
1.
string myString = "foo";
Allocate 3 bytes of memory for the array and set the value of the 1st byte to 'f' (its ASCII code rather), the 2nd byte to 'o', the 2rd byte to 'o'.
2.
foo += "bar";
Read existing string (char array) from memory pointed to by foo.
Allocate 6 bytes of memory, fill the first 3 bytes with the read contents of foo, and the next 3 bytes with b, a, and r.
3.
print(foo)
Read the string foo now points to from memory, and print it to the screen.
This is a pretty rough overview, but hopefully should give you the general idea.
Side note: In some languages/compuilers, char != byte - for example, C#, where strings are stored in Unicode format by default, and notably the length of the string is also stored in memory. C++ typically uses null-terminated strings, which solves the problem in another way, though it means determining its length is O(n) rather than O(1).
Its very language dependent. However, in most cases strings are immutable, so doing that is going to allocate a new string and release the old one's memory.
I'm assuming a typo in your sample and that there is only one variable called either foo or myString, not two variables?
I'd say that it'll depend a lot on what compiler you're using. In .Net strings are immutable so when you add "bar" you're not actually adding it but rather creating a new string containing "foobar" and telling it to put that in your variable.
In other languages it will work differently.