Binary Formatting Variables in TCL - string

I am trying to create a binary message to send over a socket, but I'm having trouble with the way TCL treats all variables as strings. I need to calculate the length of a string and know its value in binary.
set length [string length $message]
set binaryMessagePart [binary format s* { $length 0 }]
However, when I run this I get the error 'expected integer but got "$length"'. How do I get this to work and return the value for the integer 5 and not the char 5?

To calculate the length of a string, use string length. To calculate the length of a string in a particular encoding, convert the string to that encoding and use string length:
set enc "utf-8"; # Or whatever; you need to know this ahead of time for sanity's sake
set encoded [encoding convertto $enc $message]
set length [string length $encoded]
Note that with the encoded length, this will be in bytes whereas the length prior to encoding is in characters. For some messages and some encodings, the difference can be substantial.
To compose a binary message with the length and the body of the message (a fairly common binary format), use binary format like this:
# Assumes the length is big-endian; for little-endian, use i instead of I
set binPart [binary format "Ia*" $length $encoded]
What you were doing wrong was using s* which consumes a list of integers and produces a sequence of little-endian short integer binary values in the output string, and yet were feeding the list that was literally $length 0; and the string $length is not an integer as those don't start with $. We could have instead done [list $length 0] to produce the argument to s* and that would have worked, but that doesn't seem quite right for the context of the question.
In binary format, these are the common formats (there are many more):
a is for string data (mnemonically “ASCII”); this is binary string data, and you need to encode it first.
i and I are for 32-bit numbers (mnemonically “int” like in many programming languages, but especially C). Upper case is big-endian, lower case is little-endian.
s and S are for 16-bit numbers (mnemonically “short”).
c is for 8-bit numbers (mnemonically “char” from C).
w and W are for 64-bit numbers (mnemonically “wide integers”).
f and d are for IEEE binary floating point numbers (mnemonically “float” and “double” respectively, so 4 and 8 bytes).
All can be followed by an optional length, either a number or a *. For the number ones, instead of inserting a single number they insert a list of them (and so consume a list); numbers give fixed lengths, and * does “all the list”. For the string format indicator, a number uses a fixed number of bytes in the message (truncating or padding with zero bytes as necessary) and * does “all the string” (never truncating or padding).

Related

Strings, integers, data types

After several years of writing code for my own use, I'm trying to understand what does it really mean.
a = "Foo"
b = ""
c = 5
d = True
a - string variable. "Foo" (with quotes) - string literal, i.e. an entity of the string data type.
b - string variable. "" - empty string.
c - integer variable. 5 - integer literal, i.e. an entity of the integral data type.
d - Boolean variable. True - Boolean value, i.e. an entity of the Boolean data type.
Questions:
Is my understanding is correct?
It seems that 5 is an integer literal, which is an entity of the integral data type. "Integer" and "integral": for what reason we use different words here?
What is the "string" and "integer"?
As I understand from Wikipedia, "string" and "integer" are not the same thing as string/integer literals or data types. In other words, there are 3 pairs or terms:
string literal, integer literal
string data type, integer data type
string, integer
Firstly, a literal value is any value which appears literally in code, e.g "hello" is a string literal, 123 is an integer literal, etc. In contrast for example:
int a = 5;
int b = 2;
int c = a + b;
a and b have literal values assigned to them, but c does not, it has a computed value assigned to it.
With any literal value we describe the literal value with it's data type ( as in the first sentence ), e.g. "string literal" or "integer literal".
Now a data type refers to how the computer, or the software running on the computer, interprets the binary value of some data. For most kinds of data, the interpretation of the bytes is typically defined in a standard. utf-8 for example is one way to interpret the bytes of a string's internal (binary) value. Interestingly, the actual bytes of a string are treated as unsigned, 8-bit integers. In utf-8, the values of those integers are combined in various ways to determine which glyph, or character, should appear on the screen when those values are encountered in the data. utf-8 is a variable-byte-length encoding which can have between 1 and 4 bytes per character ( 8 to 32-bit ).
For numbers, particularly integers, implementations can vary, but most representations use four bytes with the most significant byte first in order, and the first bit of the first byte as the sign, with signed integers, or the first bit is simply the most significant bit for unsigned integers. This is referred to as big-endian ordering of bytes in a multi-byte integer. There is also little-endian encoding, and integers can in principle use any number of bytes, but the most typically implemented are 1, 2, 4 and sometimes 8, where bit-wise you have 8, 16, 32 or 64, respectively. For integer sizes that are not of these sizes, typically requires a custom implementation.
For floating point numbers it gets a bit more tricky. There is a common standard for floating point numbers called IEEE-754 which describes how floats are encoded. Likewise for floats, there are different sizes and variations, but primarily we use 16, 32, 64 and sometimes 24-bit in some mobile device graphics implementations. There are also extended precision floats which use 40 or 80 bits.

What determines the position of a character when looping through UTF-8 strings?

I am reading the section on for statements in the Effective Go documentation and came across this example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
The output is:
Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7
What I don't understand is why the positions are 0, 3, 6, and 7. This tells me the first and second character is 3 bytes long and the 'replacement rune' (U+FFFD) is 1 byte long, which I accept and understand. However, I thought rune was of int32 type and therefore would be 4 bytes each, not three.
Why are the positions in a range different to the total amount of memory each value should be consuming?
string values in Go are stored as read only byte slices ([]byte), where the bytes are the UTF-8 encoded bytes of the (runes of the) string. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127 are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8 package contains UTF-8 related utility functions and constants, for example utf8.UTFMax reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).
One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string value "\xff" represents an invalid UTF-8 byte sequence, for details, see How do I represent an Optional String in Go?
The for range construct –when applied on a string value– iterates over the runes of the string:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
The for range construct may produce 1 or 2 iteration values. When using 2, like in your example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
For each iteration, pos will be byte index of the rune / character, and char will be the rune of the string. As you can see in the quote above, if the string is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char will be 0xFFFD (the Unicode replacement character), and the for range construct (the iteration) will advance a singe byte only.
To sum it up: The position is always the byte index of the rune of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.
A must-read blog post if you want to know more about the topic:
The Go Blog: Strings, bytes, runes and characters in Go
rune is code point. Code point is just integer. You can even use int64 to store it if you want to. (But Unicode only has 1,114,112 code points so int32 should be the right choice. No wonder rune is alias of int32 in Golang.)
Different encoding schemes encode code points in different ways. E.g. CJK character is usually encoded to 3 bytes in UTF-8, and to 2 bytes in UTF-16.
String literal in Golang is UTF-8.

VBA Byte Array to String

Apologies if this question has been previously answered, I was unable to find an explanation. I've created a script in VBScript to encrypt an user input and match to an already encrypted password. I ran into some issues along the way and managed to deduce to the following.
I have a byte array (1 to 2) as values (1, 16). I am then defining a string with the value of the array as per below:
Dim bytArr(1 To 2) As Byte
Dim output As String
bytArr(1) = 16
bytArr(2) = 1
output = bytArr
Debug.Print output
The output I get is Ð (Eth) ASCII Value 208. Could someone please explain how the byte array is converted to this character?
In VBA, Byte Arrays are special because, unlike arrays of other datatypes, a string can be directly assigned to a byte array. In VBA, Strings are UNICODE strings, so when one assigns a string to a byte array then it stores two digits for each character;
although the glyphs seem to be the same, see charmap:
Ð is Unicode Character 'LATIN CAPITAL LETTER ETH' (U+00D0) shown in charmap DOS Western (Central) Europe character set (0xD1, i.e. decimal 209);
Đ is Unicode Character 'LATIN CAPITAL LETTER D WITH STROKE' (U+0110) shown in charmap Windows Western (Central Europe) character set (0xD0, i.e. decimal 208).
Get above statements together keeping in mind endianness (byte order) of the computer architecture: Intel x86 processors use little-endian, so byte array (0x10, 0x01) is the same as unicode string U+0110.
Charaters are amalgamated via flagrant mojibake case. For proof, please use Asc and AscW Functions as follows: Debug.Print output, Asc(output), AscW(output) with different console code pages, e.g. under chcp 852 and chcp 1250.

Perl's default string encoding and representation

In the following:
my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";
The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è which has the codepoint \x{FB01} is part of the string of $string. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
If yes why do I get the following behavior?
my $str = "Some arbitrary string\n";
if(Encode::is_utf8($str)) {
print "YES str IS UTF8!\n";
}
else {
print "NO str IT IS NOT UTF8\n";
}
This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string) returns true.
In what way are $string and $str different and one is considered UTF-8 and the other not?
And in any case what is the encoding of $str? ASCII? Is this the default for Perl?
In C, a string is a collection of octets, but Perl has two string storage formats:
String of 8-bit values.
String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)
As such, you don't need to encode code points to store them in a string.
my $s = "\x{2660}\x{2661}";
say length $s; # 2
say sprintf '%X', ord substr($s, 0, 1); # 2660
say sprintf '%X', ord substr($s, 1, 1); # 2661
(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)
Encode's is_utf8 reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.
An 8-bit string can store the value of "abc" (or the string in the OP's $str), so Perl used the more efficient 8-bit (UTF8=0) string format.
An 8-bit string can't store the value of "\x{2660}\x{2661}" (or the string in the OP's $string), so Perl used the 72-bit (UTF8=1) string format.
Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.
You can store code points in an 8-bit string (if they're small enough) just as easily as a 72-bit string.
You can store bytes in a 72-bit string just as easily as an 8-bit string.
In fact, Perl will switch between the two formats at will. For example, if you concatenate $string with $str, you'll get a string in the 72-bit format.
You can alter the storage format of a string with the builtins utf8::downgrade and utf8::upgrade, should you ever need to work around a bug.
utf8::downgrade($s); # Switch to strings of 8-bit values (UTF8=0).
utf8::upgrade($s); # Switch to strings of 72-bit values (UTF8=1).
You can see the effect using Devel::Peek.
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7bab9c "\200"\0
CUR = 1
LEN = 12
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
CUR = 2
LEN = 12
The \x{FB01} and \x{E9} are code points.
Not quiet, the numeric values inside the braces are codepoints. The whole \x expression is just a notation for a character. There are several notations for characters, most of them starting with a backslash, but the common one is the simple string literal. You might as well write:
use utf8;
my $string = "Can you find my résumé?\n";
# ↑ ↑ ↑
And code points are encoded via an encoding scheme to a series of octets.
True, but so far your string is a string of characters, not a buffer of octets.
But how does this work?
Strings consist of characters. That's just Perl's model. You as a programmer are supposed to deal with it at this level.
Of course, the computer can't, and the internal data structure must have some form of internal encoding. Far too much confusion ensues because "Perl can't keep a secret", the details leak out occasionally.
Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
No, the internal encoding is lax UTF8 (no dash). It does not have some of the restrictions that UTF-8 (a.k.a. UTF-8-strict) has.
UTF-8 goes up to 0x10_ffff, UTF8 goes up to 0xffff_ffff_ffff_ffff on my 64-bit system. Codepoints greater than 0xffff_ffff will emit a non-portability warning, though.
In UTF-8 certain codepoints are non-characters or illegal characters. In UTF8, anything goes.
Encode::is_utf8
… is an internals function, and is clearly marked as such. You as a programmer are not supposed to peek. But since you want to peek, no one can stop you. Devel::Peek::Dump is a better tool for getting at the internals.
Read http://p3rl.org/UNI for an introduction to the topic of encoding in Perl.
is_utf8 is a badly-named function that doesn't mean what you think it means or have anything to do with that. The answer to your question is that $string doesn't have an encoding, because it's not encoded. When you call Encode::encode with some encoding, the result of that will be a string that is encoded, and has a known encoding

How do int-to-string casts work in Go?

I only started Go today, so this may be obvious but I couldn't find anything on it.
What does var x uint64 = 0x12345678; y := string(x) give y?
I know var x uint8 = 65; y := string(x) would give y the byte 65, character A, and common sense would suggest (since types larger than uint8 are allowed to be cast to strings) that they would simply be packed in to native byte order (i.e little endian) and assigned to the variable.
This does not seem to be the case:
hex.EncodeToString([]byte(y)) ==> "efbfbd"
First thought says this is an address with the last byte being left off because of some weird null terminator thingy, but if I allocate two x and y variables with two different values and print them out I get the same result.
var x, x2 uint64 = 0x10000000, 0x20000000
y, y2 := string(x), string(x2)
fmt.Println(hex.EncodeToString([]byte(y))) // "efbfbd"
fmt.Println(hex.EncodeToString([]byte(y2))) // "efbfbd"
Maddeningly I can't find the implementation for the string type anywhere although I probably haven't looked hard enough.
This is covered in the Spec: Conversions: Conversions to and from a string type:
Converting a signed or unsigned integer value to a string type yields a string containing the UTF-8 representation of the integer. Values outside the range of valid Unicode code points are converted to "\uFFFD".
So effectively when you convert a numeric value to string, it can only yield a string having one rune (character). And since Go stores strings as the UTF-8 encoded byte sequences in memory, that is what you will see if you convert your string to []byte:
Converting a value of a string type to a slice of bytes type yields a slice whose successive elements are the bytes of the string.
When you try to conver the 0x12345678, 0x10000000 and 0x20000000 values to string, since they are outside of the range of valid Unicode code points, as per spec they are converted to "\uFFFD" which in UTF-8 encoding is []byte{239, 191, 189}; when encoded to hex string:
fmt.Println(hex.EncodeToString([]byte("\uFFFD"))) // Output: efbfbd
Or simply:
fmt.Printf("%x", "\uFFFD") // Output: efbfbd
Read the blog post Strings, bytes, runes and characters in Go for more details about string internals.
And btw since Go 1.5 the Go runtime is implemented (mostly) in Go, so these conversions are now implemented in Go and can be found in the runtime package: runtime/string.go, look for the intstring() function.

Resources