Haskell Char and unicode surrogates - haskell

I was playing around with strings and discovered that Haskell (correctly) disallows characters above Unicode code point 0x10ffff (ie one gets something like a sequence out of range error if one attempts to use something above this limit). Out of curiosity, i played around with the Unicode surrogate halves (0xd800 to 0xdfff) - invalid Unicode codepoints, and discovered that they seem to be permitted. I am curious as to why this is. Is it simply because being a bounded item means only defining a maximum and a minimum?

Disallowing the surrogate code units would indeed make Char a more correct type for Unicode code points. The Report says that Char is "an enumeration whose values represent Unicode characters", so probably this should be considered a GHC bug.
There's no specific notion of "a bounded item", but it would require extra checks in various places (right now chr just needs to make one comparison to check if its argument is valid, for instance) and possibly make some things behave more strangely (if people indirectly expect code points to be contiguous).
I don't know that there's an especially good rationale for it, though, or that the trade-off was even considered originally. In Haskell 1.4, Char was just a 16-bit type, so it would have been natural to extend it to 17*2^16 values without adding extra checks. This issue is occasionally brought up -- I've brought it up before -- but most people don't seem to worry about it very much. It's probably reasonable to file a GHC bug about it, though, to get a proper discussion going.
Note that Data.Text (which uses UTF-16 as its internal representations) does disallow the invalid code units (it has to).

Related

`Data.Text` vs `Data.Vector.Unboxed Char`

Is there any difference in how Data.Text and Data.Vector.Unboxed Char work internally? Why would I choose one over the other?
I always thought it was cool that Haskell defines String as [Char]. Is there a reason that something analagous wasn't done for Text and Vector Char?
There certainly would be an advantage to making them the same.... Text-y and Vector-y tools could be written to be used in both camps. Imagine Ropes of Ints, or Regexes on strings of poker cards.
Of course, I understand that there were probably historical reasons and I understand that most current libraries use Data.Text, not Vector Char, so there are many practical reasons to favor one over the other. But I am more interested in learning about the abstract qualities, not the current state that we happen to be in.... If the whole thing were rewritten tomorrow, would it be better to unify the two?
Edit, with more info-
To put stuff into perspective-
According to this page, http://www.haskell.org/haskellwiki/GHC/Memory_Footprint, GHC uses 16 bytes for each Char in your program!
Data.Text is not O(1) index'able, it is O(n).
Ropes (binary trees wrapped around text) can also hold strings.... They have better complexity for index/insert/delete, although depending on the number of nodes and balance of the tree, index could be close to that of Text.
This is my takeaway from this-
Text and Vector Char are different internally....
Use String if you don't care about performance.
If performance is important, default to using Text.
If fast indexing of chars is necessary, and you don't mind a lot of memory overhead (up to 16x), use Vector Char.
If you want to insert/delete a lot of data, use Ropes.
It's a fairly bad idea to think of Text as being a list of characters. Text is designed to be thought of as an opaque, user-readable blob of Unicode text. Character boundaries might be defined based on encoding, locale, language, time of month, phase of the moon, coin flips performed by a blinded participant, and migratory patterns of Venezuela's national bird whatever it may be. The same story happens with sorting, up-casing, reversing, etc.
Which is a long way of saying that Text is an abstract type representing human language and goes far out of its way to not behave just the same way as its implementation, be it a ByteString, a Vector UTF16CodePoint, or something totally unique (which is the case).
To clarify this distinction take note that there's no guarantee that unpack . pack witnesses an isomorphism, that the preferred ways of converting from Text to ByteString are in Data.Text.Encoding and are partial, and that there's a whole sophisticated plug-in module text-icu littered with complex ways of handling human language strings.
You absolutely should use Text if you're dealing with a human language string. You should also be really careful to treat it with care since human language strings are not easily amenable to computer processing. If your string is better thought of as a machine string, you probably should use ByteString.
The pedagogical advantages of type String = [Char] are high, but the practical advantages are quite low.
To add to what J. Abrahamson said, it's also worth making the distinction between iterating over runes (roughly character by character, but really could be ideograms too) as opposed to unitary logical unicode code points. Sometimes you need to know if you're looking at a code point that has been "decorated" by a previous code point.
In the case of the latter, you then have to make the distinction between code points that stand alone (such as letters, ideograms) and those that modify the text that follows (right-to-left code point, diacritics, etc).
Well implemented unicode libraries will typically abstract these details away and let you process the text in a more or less character-by-character fashion but you have to drop certain assumptions that come from thinking in terms of ASCII.
A byte is not a character. A logical unit of text isn't necessarily a "character". Not every code point stands alone, some decorate/annotate the following code point or even the rest of the byte stream until invalidated (right-to-left).
Unicode is hard. There is no one true encoding that will eliminate the difficulty of encapsulating the variety inherent in human language. Data.Text does a respectable job of it though.
To summarize:
The methods of processing are:
byte-by-byte - totally invalid for unicode, only applicable to latin-1/ASCII
code point by code point - works for processing unicode, but is lower-level than people realize
logical rune-by-rune - what you actually want
The types are:
String (aka [Char]) - has a limited scope. Best used for teaching Haskell or for legacy use-cases.
Text - the preferred way to handle "human" text.
Bytestring - for byte streams, raw data, binary etc.

Data.Text vs String

While the general opinion of the Haskell community seems to be that it's always better to use Text instead of String, the fact that still the APIs of most of maintained libraries are String-oriented confuses the hell out of me. On the other hand, there are notable projects, which consider String as a mistake altogether and provide a Prelude with all instances of String-oriented functions replaced with their Text-counterparts.
So are there any reasons for people to keep writing String-oriented APIs except backwards- and standard Prelude-compatibility and the "switch-making inertia"?
Are there possibly any other drawbacks to Text as compared to String?
Particularly, I'm interested in this because I'm designing a library and trying to decide which type to use to express error messages.
My unqualified guess is that most library writers don't want to add more dependencies than necessary. Since strings are part of literally every Haskell distribution (it's part of the language standard!), it is a lot easier to get adopted if you use strings and don't require your users to sort out Text distributions from hackage.
It's one of those "design mistakes" that you just have to live with unless you can convince most of the community to switch over night. Just look at how long it has taken to get Applicative to be a superclass of Monad – a relatively minor but much wanted change – and imagine how long it would take to replace all the String things with Text.
To answer your more specific question: I would go with String unless you get noticeable performance benefits by using Text. Error messages are usually rather small one-off things so it shouldn't be a big problem to use String.
On the other hand, if you are the kind of ideological purist that eschews pragmatism for idealism, go with Text.
* I put design mistakes in scare quotes because strings as a list-of-chars is a neat property that makes them easy to reason about and integrate with other existing list-operating functions.
If your API is targeted at processing large amounts of character oriented data and/or various encodings, then your API should use Text.
If your API is primarily for dealing with small one-off strings, then using the built-in String type should be fine.
Using String for large amounts of text will make applications using your API consume significantly more memory. Using it with foreign encodings could seriously complicate usage depending on how your API works.
String is quite expensive (at least 5N words where N is the number of Char in the String). A word is same number of bits as the processor architecture (ex. 32 bits or 64 bits):
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
There are at least three reasons to use [Char] in small projects.
[Char] does not rely on any arcane staff, like foreign pointers, raw memory, raw arrays, etc that may work differently on different platforms or even be unavailable altogether
[Char] is the lingua franka in haskell. There are at least three 'efficient' ways to handle unicode data in haskell: utf8-bytestring, Data.Text.Text and Data.Vector.Unboxed.Vector Char, each requiring dealing with extra package.
by using [Char] one gains access to all power of [] monad, including many specific functions (alternative string packages do try to help with it, but still)
Personally, I consider utf16-based Data.Text one of the most questionable desicions of the haskell community, since utf16 combines flaws of both utf8 and utf32 encoding while having none of their benefits.
I wonder if Data.Text is always more efficient than Data.String???
"cons" for instance is O(1) for Strings and O(n) for Text. Append is O(n) for Strings and O(n+m) for strict Text's. Likewise,
let foo = "foo" ++ bigchunk
bar = "bar" ++ bigchunk
is more space efficient for Strings than for strict Texts.
Other issue not related to efficiency is pattern matching (perspicuous code) and lazyness (predictably per-character in Strings, somehow implementation dependent in lazy Text).
Text's are obviously good for static character sequences and for in-place modification. For other forms of structural editing, Data.String might have advantages.
I do not think there is a single technical reason for String to remain.
And I can see several ones for it to go.
Overall I would first argue that in the Text/String case there is only one best solution :
String performances are bad, everyone agrees on that
Text is not difficult to use. All functions commonly used on String are available on Text, plus some useful more in the context of strings (substitution, padding, encoding)
having two solutions creates unnecessary complexity unless all base functions are made polymorphic. Proof : there are SO questions on the subject of automatic conversions. So this is a problem.
So one solution is less complex than two, and the shortcomings of String will make it disappear eventually. The sooner the better !

When to use Unicode Normalization Forms NFC and NFD?

The Unicode Normalization FAQ includes the following paragraph:
Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and NFD.
and continues...
The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.
My questions are:
What makes NFC best for "general text." What defines "internal processing" and why is it best left to NFD? And finally, never minding what is "best," are the two forms interchangable as long as two strings are compared using the same normalization form?
The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.
In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.
For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.
There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.
It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.
NFC is the general common sense form that you should use, ä is 1 code point there and that makes sense.
NFD is good for certain internal processing - if you want to make accent-insensitive searches or sorting, having your string in NFD makes it much easier and faster. Another usage is making more robust slug titles. These are just the most obvious ones, I am sure there are plenty of more uses.
If two strings x and y are canonical equivalents, then
toNFC(x) = toNFC(y)
toNFD(x) = toNFD(y)
Is that what you meant?

Why do a lot of programming languages put the type *after* the variable name?

I just came across this question in the Go FAQ, and it reminded me of something that's been bugging me for a while. Unfortunately, I don't really see what the answer is getting at.
It seems like almost every non C-like language puts the type after the variable name, like so:
var : int
Just out of sheer curiosity, why is this? Are there advantages to choosing one or the other?
There is a parsing issue, as Keith Randall says, but it isn't what he describes. The "not knowing whether it is a declaration or an expression" simply doesn't matter - you don't care whether it's an expression or a declaration until you've parsed the whole thing anyway, at which point the ambiguity is resolved.
Using a context-free parser, it doesn't matter in the slightest whether the type comes before or after the variable name. What matters is that you don't need to look up user-defined type names to understand the type specification - you don't need to have understood everything that came before in order to understand the current token.
Pascal syntax is context-free - if not completely, at least WRT this issue. The fact that the variable name comes first is less important than details such as the colon separator and the syntax of type descriptions.
C syntax is context-sensitive. In order for the parser to determine where a type description ends and which token is the variable name, it needs to have already interpreted everything that came before so that it can determine whether a given identifier token is the variable name or just another token contributing to the type description.
Because C syntax is context-sensitive, it very difficult (if not impossible) to parse using traditional parser-generator tools such as yacc/bison, whereas Pascal syntax is easy to parse using the same tools. That said, there are parser generators now that can cope with C and even C++ syntax. Although it's not properly documented or in a 1.? release etc, my personal favorite is Kelbt, which uses backtracking LR and supports semantic "undo" - basically undoing additions to the symbol table when speculative parses turn out to be wrong.
In practice, C and C++ parsers are usually hand-written, mixing recursive descent and precedence parsing. I assume the same applies to Java and C#.
Incidentally, similar issues with context sensitivity in C++ parsing have created a lot of nasties. The "Alternative Function Syntax" for C++0x is working around a similar issue by moving a type specification to the end and placing it after a separator - very much like the Pascal colon for function return types. It doesn't get rid of the context sensitivity, but adopting that Pascal-like convention does make it a bit more manageable.
the 'most other' languages you speak of are those that are more declarative. They aim to allow you to program more along the lines you think in (assuming you aren't boxed into imperative thinking).
type last reads as 'create a variable called NAME of type TYPE'
this is the opposite of course to saying 'create a TYPE called NAME', but when you think about it, what the value is for is more important than the type, the type is merely a programmatic constraint on the data
If the name of the variable starts at column 0, it's easier to find the name of the variable.
Compare
QHash<QString, QPair<int, QString> > hash;
and
hash : QHash<QString, QPair<int, QString> >;
Now imagine how much more readable your typical C++ header could be.
In formal language theory and type theory, it's almost always written as var: type. For instance, in the typed lambda calculus you'll see proofs containing statements such as:
x : A y : B
-------------
\x.y : A->B
I don't think it really matters, but I think there are two justifications: one is that "x : A" is read "x is of type A", the other is that a type is like a set (e.g. int is the set of integers), and the notation is related to "x ε A".
Some of this stuff pre-dates the modern languages you're thinking of.
An increasing trend is to not state the type at all, or to optionally state the type. This could be a dynamically typed langauge where there really is no type on the variable, or it could be a statically typed language which infers the type from the context.
If the type is sometimes given and sometimes inferred, then it's easier to read if the optional bit comes afterwards.
There are also trends related to whether a language regards itself as coming from the C school or the functional school or whatever, but these are a waste of time. The languages which improve on their predecessors and are worth learning are the ones that are willing to accept input from all different schools based on merit, not be picky about a feature's heritage.
"Those who cannot remember the past are condemned to repeat it."
Putting the type before the variable started innocuously enough with Fortran and Algol, but it got really ugly in C, where some type modifiers are applied before the variable, others after. That's why in C you have such beauties as
int (*p)[10];
or
void (*signal(int x, void (*f)(int)))(int)
together with a utility (cdecl) whose purpose is to decrypt such gibberish.
In Pascal, the type comes after the variable, so the first examples becomes
p: pointer to array[10] of int
Contrast with
q: array[10] of pointer to int
which, in C, is
int *q[10]
In C, you need parentheses to distinguish this from int (*p)[10]. Parentheses are not required in Pascal, where only the order matters.
The signal function would be
signal: function(x: int, f: function(int) to void) to (function(int) to void)
Still a mouthful, but at least within the realm of human comprehension.
In fairness, the problem isn't that C put the types before the name, but that it perversely insists on putting bits and pieces before, and others after, the name.
But if you try to put everything before the name, the order is still unintuitive:
int [10] a // an int, ahem, ten of them, called a
int [10]* a // an int, no wait, ten, actually a pointer thereto, called a
So, the answer is: A sensibly designed programming language puts the variables before the types because the result is more readable for humans.
I'm not sure, but I think it's got to do with the "name vs. noun" concept.
Essentially, if you put the type first (such as "int varname"), you're declaring an "integer named 'varname'"; that is, you're giving an instance of a type a name. However, if you put the name first, and then the type (such as "varname : int"), you're saying "this is 'varname'; it's an integer". In the first case, you're giving an instance of something a name; in the second, you're defining a noun and stating that it's an instance of something.
It's a bit like if you were defining a table as a piece of furniture; saying "this is furniture and I call it 'table'" (type first) is different from saying "a table is a kind of furniture" (type last).
It's just how the language was designed. Visual Basic has always been this way.
Most (if not all) curly brace languages put the type first. This is more intuitive to me, as the same position also specifies the return type of a method. So the inputs go into the parenthesis, and the output goes out the back of the method name.
I always thought the way C does it was slightly peculiar: instead of constructing types, the user has to declare them implicitly. It's not just before/after the variable name; in general, you may need to embed the variable name among the type attributes (or, in some usage, to embed an empty space where the name would be if you were actually declaring one).
As a weak form of pattern-matching, it is intelligable to some extent, but it doesn't seem to provide any particular advantages, either. And, trying to write (or read) a function pointer type can easily take you beyond the point of ready intelligability. So overall this aspect of C is a disadvantage, and I'm happy to see that Go has left it behind.
Putting the type first helps in parsing. For instance, in C, if you declared variables like
x int;
When you parse just the x, then you don't know whether x is a declaration or an expression. In contrast, with
int x;
When you parse the int, you know you're in a declaration (types always start a declaration of some sort).
Given progress in parsing languages, this slight help isn't terribly useful nowadays.
Fortran puts the type first:
REAL*4 I,J,K
INTEGER*4 A,B,C
And yes, there's a (very feeble) joke there for those familiar with Fortran.
There is room to argue that this is easier than C, which puts the type information around the name when the type is complex enough (pointers to functions, for example).
What about dynamically (cheers #wcoenen) typed languages? You just use the variable.

Efficient String Implementation in Haskell

I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.
The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)
But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.
As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.
Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.
Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/
As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.
Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.
The answer is a bit more complex than just "use lazy bytestrings".
Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.
That said, if you need performance and you can represent your data purely in bytestrings, then do so.
The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.
Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.
Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.
But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).
For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.

Resources