Why is there a maxBound of Char? If Char is character then why it is explained by numbers, and if it is not a number what does it mean?
> maxBound :: Char
'\1114111'
All characters, like all things in a computer, are ultimately just numbers. Char represents unicode characters, which are represented via numbers. You can convert between Char and Int values with ord and chr. E.g. the unicode value for a is 97, so ord 'a' is 97 and chr 97 is 'a'.
Char '\1114111' is the Char that represents the number 1114111, or 0x10FFFF, which is defined as a noncharacter. This is the largest value that is defined in Unicode, and is the largest that Haskell supports: '\1114112' will cause a compile error.
Character encodings are tricky. Behind the scenes, all characters are represented by numbers. The Unicode standard provides a set of "code points" which are simply numbers which map to a particular sequence of real characters. Unicode defines code points between 0 and 1114111 and so that's what you see when you try maxBound.
Char encodes Unicode code points as individual integers, which is somewhat inefficient. If you want an efficient encoding, use Text.
You're seeing \1114111 displayed because that's the code point that maxBound :: Char represents and there is no more efficient, meaningful way to display it. In particular, it's in the "Supplementary Private Use Area-B" of the Unicode standard which means that it's reserved for use outside of the scope of Unicode and thus has no standard meaning.
The Char data type represents Unicode values. These values are stored in the computer as numbers, and each number as a specific representation on the screen. For Char, the minimum value is 0 and the maximum value is 1114111.
An easier example is C in which the char type is equivalent to a 7-bit number corresponding to the ASCII table of characters and they can range in value from 0 to 127, although I believe it is legal to store an entire 8-bit byte in a char, giving you the values 0 through 255.
Remember, everything is a number to a computer. Some data types have representations that can be ordered and are finite, so they have a minimum value and a maximum value.
An example of a data type in Haskell that does not have a minimum or maximum value is Integer, since it can represent any integer value so long as you have enough RAM available.
It's helpful to look at the source of the Bounded Char instance itself. Characters are effectively numbers with representation and the bounds represent the bounds of the Unicode code points.
instance Bounded Char where
minBound = '\0'
maxBound = '\xffff'
Related
I am trying to create a binary message to send over a socket, but I'm having trouble with the way TCL treats all variables as strings. I need to calculate the length of a string and know its value in binary.
set length [string length $message]
set binaryMessagePart [binary format s* { $length 0 }]
However, when I run this I get the error 'expected integer but got "$length"'. How do I get this to work and return the value for the integer 5 and not the char 5?
To calculate the length of a string, use string length. To calculate the length of a string in a particular encoding, convert the string to that encoding and use string length:
set enc "utf-8"; # Or whatever; you need to know this ahead of time for sanity's sake
set encoded [encoding convertto $enc $message]
set length [string length $encoded]
Note that with the encoded length, this will be in bytes whereas the length prior to encoding is in characters. For some messages and some encodings, the difference can be substantial.
To compose a binary message with the length and the body of the message (a fairly common binary format), use binary format like this:
# Assumes the length is big-endian; for little-endian, use i instead of I
set binPart [binary format "Ia*" $length $encoded]
What you were doing wrong was using s* which consumes a list of integers and produces a sequence of little-endian short integer binary values in the output string, and yet were feeding the list that was literally $length 0; and the string $length is not an integer as those don't start with $. We could have instead done [list $length 0] to produce the argument to s* and that would have worked, but that doesn't seem quite right for the context of the question.
In binary format, these are the common formats (there are many more):
a is for string data (mnemonically “ASCII”); this is binary string data, and you need to encode it first.
i and I are for 32-bit numbers (mnemonically “int” like in many programming languages, but especially C). Upper case is big-endian, lower case is little-endian.
s and S are for 16-bit numbers (mnemonically “short”).
c is for 8-bit numbers (mnemonically “char” from C).
w and W are for 64-bit numbers (mnemonically “wide integers”).
f and d are for IEEE binary floating point numbers (mnemonically “float” and “double” respectively, so 4 and 8 bytes).
All can be followed by an optional length, either a number or a *. For the number ones, instead of inserting a single number they insert a list of them (and so consume a list); numbers give fixed lengths, and * does “all the list”. For the string format indicator, a number uses a fixed number of bytes in the message (truncating or padding with zero bytes as necessary) and * does “all the string” (never truncating or padding).
After several years of writing code for my own use, I'm trying to understand what does it really mean.
a = "Foo"
b = ""
c = 5
d = True
a - string variable. "Foo" (with quotes) - string literal, i.e. an entity of the string data type.
b - string variable. "" - empty string.
c - integer variable. 5 - integer literal, i.e. an entity of the integral data type.
d - Boolean variable. True - Boolean value, i.e. an entity of the Boolean data type.
Questions:
Is my understanding is correct?
It seems that 5 is an integer literal, which is an entity of the integral data type. "Integer" and "integral": for what reason we use different words here?
What is the "string" and "integer"?
As I understand from Wikipedia, "string" and "integer" are not the same thing as string/integer literals or data types. In other words, there are 3 pairs or terms:
string literal, integer literal
string data type, integer data type
string, integer
Firstly, a literal value is any value which appears literally in code, e.g "hello" is a string literal, 123 is an integer literal, etc. In contrast for example:
int a = 5;
int b = 2;
int c = a + b;
a and b have literal values assigned to them, but c does not, it has a computed value assigned to it.
With any literal value we describe the literal value with it's data type ( as in the first sentence ), e.g. "string literal" or "integer literal".
Now a data type refers to how the computer, or the software running on the computer, interprets the binary value of some data. For most kinds of data, the interpretation of the bytes is typically defined in a standard. utf-8 for example is one way to interpret the bytes of a string's internal (binary) value. Interestingly, the actual bytes of a string are treated as unsigned, 8-bit integers. In utf-8, the values of those integers are combined in various ways to determine which glyph, or character, should appear on the screen when those values are encountered in the data. utf-8 is a variable-byte-length encoding which can have between 1 and 4 bytes per character ( 8 to 32-bit ).
For numbers, particularly integers, implementations can vary, but most representations use four bytes with the most significant byte first in order, and the first bit of the first byte as the sign, with signed integers, or the first bit is simply the most significant bit for unsigned integers. This is referred to as big-endian ordering of bytes in a multi-byte integer. There is also little-endian encoding, and integers can in principle use any number of bytes, but the most typically implemented are 1, 2, 4 and sometimes 8, where bit-wise you have 8, 16, 32 or 64, respectively. For integer sizes that are not of these sizes, typically requires a custom implementation.
For floating point numbers it gets a bit more tricky. There is a common standard for floating point numbers called IEEE-754 which describes how floats are encoded. Likewise for floats, there are different sizes and variations, but primarily we use 16, 32, 64 and sometimes 24-bit in some mobile device graphics implementations. There are also extended precision floats which use 40 or 80 bits.
When creating a String object in Swift you can use a String Format Specifier to convert an integer to hexadecimal notation.
print(String(format:"%x", 1234))
// output: 4d2
// expected output: 4d2
But when numbers become bigger, the output is not as expected.
print(String(format:"%x", 12345678901234))
// output: 73ce2ff2
// expected output: b3a73ce2ff2
It seems that the output of String(format:"%x", n) is truncated at 8 characters. I don't think in hexadecimal natively, this makes debugging hard. I have seen answers for other programming languages where it is explained that you need to brake-up the large integer into parts, but that seems wrong to me.
What am I doing wrong here?
What is the right way to convert decimal numbers to hexadecimal numbers in Swift?
You need to use %lx or %llx
print(String(format:"%lx", 12345678901234))
b3a73ce2ff2
Table 2 on the site you linked specifies them
l -
Length modifier specifying that a following d, o, u, x, or X conversion specifier applies to a long or unsigned long argument.
x is for unsigned 32 bit integers which only go up to 4.294.967.296
I want to convert a string like "//u****" to text (unicode) in Haskell.
I have a Java propertyes file, and it has the following content:
i18n.test.key=\u0050\u0069\u006e\u0067\u0020\uc190\uc2e4\ub960\u0020\ud50c\ub7ec\uadf8\uc778
I wanna convert it to text (Unicode) in Haskell.
I think I can do it like this:
Convert "\u****" to word8 array
Convert word8 array to ByteString
Use Text.Encoding.decodeUtf8 convert ByteString to text
But step 1 is little complicated for me.
How to do it in Haskell?
A simple solution may look like this:
decodeJava = T.decodeUtf16BE . BS.concat . gobble
gobble [] = []
gobble ('\\':'u':a:b:c:d:rest) = let sym = convert16 [a,b] [c,d]
in sym : gobble rest
gobble _ = error "decoding error"
convert16 hi lo = BS.pack [read $ "0x"++hi, read $ "0x"++lo]
Notes:
Your string is UTF16-encoded, therefore you need decodeUtf16BE.
Decoding will fail if there are other characters in the string. This code will work with your example only if you remove the trailing i.
Constructing the words by appending 0x and, in particular, using read is very slow, but will do the trick for small data.
If you replace \u with \x then this is a valid Haskell string literal.
my_string = "\x0050\x0069\x006e..."
You can then convert to Text if you want, or leave it as String, or whatever.
Watch out, Java normally uses UTF-16 to encode its strings, so interpreting the bytes as UTF-8 will probably not work.
If the codes in your file are UTF-16, you need to do the following:
find the numeric value (Unicode code point) for each quadrupel
check if this is a high surrogate character. If this is so, the following character will be a low surrogate character. The pair of surrogate characters can be mapped to a Unicode point.
make a String from your list of unicode numbers with map fromEnum
The following is a quote from the Java doc http://docs.oracle.com/javase/7/docs/api/ :
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Java has methods to combine a high surrogate character and a low surrogate character to get the Unicode point. You may want to check the source of the java.lang.Character class to find out how exactly they do this, but I guess it is some simple bit-operation.
Another possibility would be to check for a Haskell library that does UTF-16 decoding.
Just for curiosity,
Why in Delphi, if we defined an empty char by:
a:Char;
a:='';
we get an error: Incompatible types: 'Char' and 'string'
However, if we placed
a:='a';
it will be fine?
Is it necessary to define an empty char by: a:=#0?
A char is a single (that is, exactly one) character. So 'a', '∫', and '⌬' are all OK, but not 'ab' (a two-character string), 'Hello World!' (a twelve-character string), or '' (a zero-character string).
However, the NULL character (#0) is a character like any other.
In addition, the character datatype is implemented as a word (in modern versions of Delphi), that is, as two bytes. If all these values 0, 1, ..., 2^16 - 1 are used for real characters, how in the world would you represent your 'empty char'?
There is no such thing as an empty char. A char has to have a value. It is an ordinal type, a simple value type. Just as an integer, say, always has a value, so does a char.
The value #0 is not an empty char, it is the character with value 0, commonly known as NUL.