Strings, integers, data types - string

After several years of writing code for my own use, I'm trying to understand what does it really mean.
a = "Foo"
b = ""
c = 5
d = True
a - string variable. "Foo" (with quotes) - string literal, i.e. an entity of the string data type.
b - string variable. "" - empty string.
c - integer variable. 5 - integer literal, i.e. an entity of the integral data type.
d - Boolean variable. True - Boolean value, i.e. an entity of the Boolean data type.
Questions:
Is my understanding is correct?
It seems that 5 is an integer literal, which is an entity of the integral data type. "Integer" and "integral": for what reason we use different words here?
What is the "string" and "integer"?
As I understand from Wikipedia, "string" and "integer" are not the same thing as string/integer literals or data types. In other words, there are 3 pairs or terms:
string literal, integer literal
string data type, integer data type
string, integer

Firstly, a literal value is any value which appears literally in code, e.g "hello" is a string literal, 123 is an integer literal, etc. In contrast for example:
int a = 5;
int b = 2;
int c = a + b;
a and b have literal values assigned to them, but c does not, it has a computed value assigned to it.
With any literal value we describe the literal value with it's data type ( as in the first sentence ), e.g. "string literal" or "integer literal".
Now a data type refers to how the computer, or the software running on the computer, interprets the binary value of some data. For most kinds of data, the interpretation of the bytes is typically defined in a standard. utf-8 for example is one way to interpret the bytes of a string's internal (binary) value. Interestingly, the actual bytes of a string are treated as unsigned, 8-bit integers. In utf-8, the values of those integers are combined in various ways to determine which glyph, or character, should appear on the screen when those values are encountered in the data. utf-8 is a variable-byte-length encoding which can have between 1 and 4 bytes per character ( 8 to 32-bit ).
For numbers, particularly integers, implementations can vary, but most representations use four bytes with the most significant byte first in order, and the first bit of the first byte as the sign, with signed integers, or the first bit is simply the most significant bit for unsigned integers. This is referred to as big-endian ordering of bytes in a multi-byte integer. There is also little-endian encoding, and integers can in principle use any number of bytes, but the most typically implemented are 1, 2, 4 and sometimes 8, where bit-wise you have 8, 16, 32 or 64, respectively. For integer sizes that are not of these sizes, typically requires a custom implementation.
For floating point numbers it gets a bit more tricky. There is a common standard for floating point numbers called IEEE-754 which describes how floats are encoded. Likewise for floats, there are different sizes and variations, but primarily we use 16, 32, 64 and sometimes 24-bit in some mobile device graphics implementations. There are also extended precision floats which use 40 or 80 bits.

Related

Binary Formatting Variables in TCL

I am trying to create a binary message to send over a socket, but I'm having trouble with the way TCL treats all variables as strings. I need to calculate the length of a string and know its value in binary.
set length [string length $message]
set binaryMessagePart [binary format s* { $length 0 }]
However, when I run this I get the error 'expected integer but got "$length"'. How do I get this to work and return the value for the integer 5 and not the char 5?
To calculate the length of a string, use string length. To calculate the length of a string in a particular encoding, convert the string to that encoding and use string length:
set enc "utf-8"; # Or whatever; you need to know this ahead of time for sanity's sake
set encoded [encoding convertto $enc $message]
set length [string length $encoded]
Note that with the encoded length, this will be in bytes whereas the length prior to encoding is in characters. For some messages and some encodings, the difference can be substantial.
To compose a binary message with the length and the body of the message (a fairly common binary format), use binary format like this:
# Assumes the length is big-endian; for little-endian, use i instead of I
set binPart [binary format "Ia*" $length $encoded]
What you were doing wrong was using s* which consumes a list of integers and produces a sequence of little-endian short integer binary values in the output string, and yet were feeding the list that was literally $length 0; and the string $length is not an integer as those don't start with $. We could have instead done [list $length 0] to produce the argument to s* and that would have worked, but that doesn't seem quite right for the context of the question.
In binary format, these are the common formats (there are many more):
a is for string data (mnemonically “ASCII”); this is binary string data, and you need to encode it first.
i and I are for 32-bit numbers (mnemonically “int” like in many programming languages, but especially C). Upper case is big-endian, lower case is little-endian.
s and S are for 16-bit numbers (mnemonically “short”).
c is for 8-bit numbers (mnemonically “char” from C).
w and W are for 64-bit numbers (mnemonically “wide integers”).
f and d are for IEEE binary floating point numbers (mnemonically “float” and “double” respectively, so 4 and 8 bytes).
All can be followed by an optional length, either a number or a *. For the number ones, instead of inserting a single number they insert a list of them (and so consume a list); numbers give fixed lengths, and * does “all the list”. For the string format indicator, a number uses a fixed number of bytes in the message (truncating or padding with zero bytes as necessary) and * does “all the string” (never truncating or padding).

Why string cannot convert to other data type array except uint8 and int32?

When I try to convert string to []int, compile fail. and I found string can convert to int32(rune) and uint8(byte).
This is my test code:
s1 := "abcd"
b1 := []byte(s1)
r1 := []rune(s1)
i1 := []int8(s1) //error
The short answer is because the language specification does not allow it.
The allowed conversions for non-constant values: Spec: Conversions:
A non-constant value x can be converted to type T in any of these cases:
x is assignable to T.
ignoring struct tags (see below), x's type and T have identical underlying types.
ignoring struct tags (see below), x's type and T are pointer types that are not defined types, and their pointer base types have identical underlying types.
x's type and T are both integer or floating point types.
x's type and T are both complex types.
x is an integer or a slice of bytes or runes and T is a string type.
x is a string and T is a slice of bytes or runes.
The longer answer is:
Spec: Conversions: Conversions to and from a string type:
Converting a signed or unsigned integer value to a string type yields a string containing the UTF-8 representation of the integer. Values outside the range of valid Unicode code points are converted to "\uFFFD".
Converting a slice of bytes to a string type yields a string whose successive bytes are the elements of the slice.
Converting a slice of runes to a string type yields a string that is the concatenation of the individual rune values converted to strings.
Converting a value of a string type to a slice of bytes type yields a slice whose successive elements are the bytes of the string.
Converting a value of a string type to a slice of runes type yields a slice containing the individual Unicode code points of the string.
Converting a string to []byte is "useful" because that is the UTF-8 encoded byte sequence of the text, this is exactly how Go stores strings in memory, and this is usually the data you should store / transmit in order to deliver a string over byte-streams (such as io.Writer), and similarly, this is what you can get out of an io.Reader.
Converting a string to []rune is also useful, it result in the characters (runes) of the text, so you can easily inspect / operate on the characters of a string (which is often needed in real-life applications).
Converting a string to []int8 is not so much useful given that byte-streams operate on bytes (which is an alias to uint8, not int8). If in a specific case you need a []int8 from a string, you can write your custom converter (which would most likely convert the individual bytes of the string to int8 values).
from the go documentation:
string is the set of all strings of 8-bit bytes, conventionally but not
necessarily representing UTF-8-encoded text. A string may be empty, but
not nil. Values of string type are immutable.
so a string is the same as []uint8
that is why you can convert it to rune and uint8

Convert large decimal number to hexadecimal notation

When creating a String object in Swift you can use a String Format Specifier to convert an integer to hexadecimal notation.
print(String(format:"%x", 1234))
// output: 4d2
// expected output: 4d2
But when numbers become bigger, the output is not as expected.
print(String(format:"%x", 12345678901234))
// output: 73ce2ff2
// expected output: b3a73ce2ff2
It seems that the output of String(format:"%x", n) is truncated at 8 characters. I don't think in hexadecimal natively, this makes debugging hard. I have seen answers for other programming languages where it is explained that you need to brake-up the large integer into parts, but that seems wrong to me.
What am I doing wrong here?
What is the right way to convert decimal numbers to hexadecimal numbers in Swift?
You need to use %lx or %llx
print(String(format:"%lx", 12345678901234))
b3a73ce2ff2
Table 2 on the site you linked specifies them
l -
Length modifier specifying that a following d, o, u, x, or X conversion specifier applies to a long or unsigned long argument.
x is for unsigned 32 bit integers which only go up to 4.294.967.296

Why does Char have an instance for Bounded?

Why is there a maxBound of Char? If Char is character then why it is explained by numbers, and if it is not a number what does it mean?
> maxBound :: Char
'\1114111'
All characters, like all things in a computer, are ultimately just numbers. Char represents unicode characters, which are represented via numbers. You can convert between Char and Int values with ord and chr. E.g. the unicode value for a is 97, so ord 'a' is 97 and chr 97 is 'a'.
Char '\1114111' is the Char that represents the number 1114111, or 0x10FFFF, which is defined as a noncharacter. This is the largest value that is defined in Unicode, and is the largest that Haskell supports: '\1114112' will cause a compile error.
Character encodings are tricky. Behind the scenes, all characters are represented by numbers. The Unicode standard provides a set of "code points" which are simply numbers which map to a particular sequence of real characters. Unicode defines code points between 0 and 1114111 and so that's what you see when you try maxBound.
Char encodes Unicode code points as individual integers, which is somewhat inefficient. If you want an efficient encoding, use Text.
You're seeing \1114111 displayed because that's the code point that maxBound :: Char represents and there is no more efficient, meaningful way to display it. In particular, it's in the "Supplementary Private Use Area-B" of the Unicode standard which means that it's reserved for use outside of the scope of Unicode and thus has no standard meaning.
The Char data type represents Unicode values. These values are stored in the computer as numbers, and each number as a specific representation on the screen. For Char, the minimum value is 0 and the maximum value is 1114111.
An easier example is C in which the char type is equivalent to a 7-bit number corresponding to the ASCII table of characters and they can range in value from 0 to 127, although I believe it is legal to store an entire 8-bit byte in a char, giving you the values 0 through 255.
Remember, everything is a number to a computer. Some data types have representations that can be ordered and are finite, so they have a minimum value and a maximum value.
An example of a data type in Haskell that does not have a minimum or maximum value is Integer, since it can represent any integer value so long as you have enough RAM available.
It's helpful to look at the source of the Bounded Char instance itself. Characters are effectively numbers with representation and the bounds represent the bounds of the Unicode code points.
instance Bounded Char where
minBound = '\0'
maxBound = '\xffff'

How do int-to-string casts work in Go?

I only started Go today, so this may be obvious but I couldn't find anything on it.
What does var x uint64 = 0x12345678; y := string(x) give y?
I know var x uint8 = 65; y := string(x) would give y the byte 65, character A, and common sense would suggest (since types larger than uint8 are allowed to be cast to strings) that they would simply be packed in to native byte order (i.e little endian) and assigned to the variable.
This does not seem to be the case:
hex.EncodeToString([]byte(y)) ==> "efbfbd"
First thought says this is an address with the last byte being left off because of some weird null terminator thingy, but if I allocate two x and y variables with two different values and print them out I get the same result.
var x, x2 uint64 = 0x10000000, 0x20000000
y, y2 := string(x), string(x2)
fmt.Println(hex.EncodeToString([]byte(y))) // "efbfbd"
fmt.Println(hex.EncodeToString([]byte(y2))) // "efbfbd"
Maddeningly I can't find the implementation for the string type anywhere although I probably haven't looked hard enough.
This is covered in the Spec: Conversions: Conversions to and from a string type:
Converting a signed or unsigned integer value to a string type yields a string containing the UTF-8 representation of the integer. Values outside the range of valid Unicode code points are converted to "\uFFFD".
So effectively when you convert a numeric value to string, it can only yield a string having one rune (character). And since Go stores strings as the UTF-8 encoded byte sequences in memory, that is what you will see if you convert your string to []byte:
Converting a value of a string type to a slice of bytes type yields a slice whose successive elements are the bytes of the string.
When you try to conver the 0x12345678, 0x10000000 and 0x20000000 values to string, since they are outside of the range of valid Unicode code points, as per spec they are converted to "\uFFFD" which in UTF-8 encoding is []byte{239, 191, 189}; when encoded to hex string:
fmt.Println(hex.EncodeToString([]byte("\uFFFD"))) // Output: efbfbd
Or simply:
fmt.Printf("%x", "\uFFFD") // Output: efbfbd
Read the blog post Strings, bytes, runes and characters in Go for more details about string internals.
And btw since Go 1.5 the Go runtime is implemented (mostly) in Go, so these conversions are now implemented in Go and can be found in the runtime package: runtime/string.go, look for the intstring() function.

Resources