Haddock seems to incorrectly re-encode non-ASCII characters in the documentation in UTF-8 encoded source files. I often need to include mathematical formulas in the documentation and they are much more readable if some common math symbols such as summation (∑) can be used.
However, after running the files through haddock, these symbols become blank squares.
Haddock has the option --use-unicode but that just converts function arrows in function signatures etc. into unicode characters, while still breaking the actually documentation.
Even better would be if this can be controlled from cabal haddock!
I'm using Haddock version 2.9.4.
Note that Haddock uses the GHC API to do parsing. Non-ASCII characters in comments are not handled properly by GHC < 7.4, but it seems that with GHC 7.4 it works fine.
If UTF-8 cannot be used and numeric character references like ∑ or ∑ (these are correct references for the n-ary summation symbol ∑) are regarded as unreadable, then I’m afraid the only option is to use named references like ∑, if they get passed thru to the HTML result and are supported by the browser(s) that will be used.
That’s a big “if,” since the new HTML5 entities have rather limited support, but perhaps in an intranet where everyone uses Firefox... HTML5 entities:
http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html
(And most of the references are not as mnemonic as ∑.)
Related
I have some ASCII-encoded files containing ascii representations of individual Unicode characters like ..., --, and so on that I'd like to convert to e.g. Unicode ellipsis and en-dash symbols for display purposes. This could be as simple as a simple replace filter over all such mappings (in the right order, to catch things like --- -> — and -- -> –, of course). (note: there are more than just those)
Does there exist a database of all such conversions somewhere? I assume the inverse must exist somehow to be able to gracefully convert unicode to plaintext whenever possible, e.g. … -> ....
It doesn't have to be extremely accurate or anything as long as the conversion is appropriate in most cases and makes sense. The output will be just be displayed to the user and won't be further processed. I could just compile a list myself as I go but it would be nice to save time and avoid duplicating effort if it has already been done.
Thanks!
A comprehensive list isn't a very good idea as there are a lot of Unicode characters that exist for compatibility, or are poorly supported (see my comment). Instead, you probably want to use a curated list/library like SmartyPants (ports/alternatives can be found for most other languages).
I am writing an erlang module that has to deal a bit with strings, not too much, however, I do some tcp recv and then some parsing over the data.
While matching data and manipulating strings, I am using binary module all the time like binary:split(Data,<<":">>) and basically using <<"StringLiteral">> all the time.
Till now I have not encounter difficulties or missing methods from the alternative( using lists) and everything is coming out quite naturally except maybe for adding the <<>>, but I was wondering if this way of dealing with strings might have drawbacks I am not aware of.
Any hint?
As long as you and your team remember that your strings are binaries and not lists, there are no inherent problems with this approach. In fact, Couch DB took this approach as an optimization which apparently paid nice dividends.
You do need to be very aware of how your string is encoded in your binaries. When you do <<"StringLiteral">> in your code, you have to be aware that this is simply a binary serialization of the list of code-points. Your Erlang compiler reads your code as ISO-8859-1 characters, so as long as you only use Latin-1 characters and do this consistently, you should be fine, But this isn't very friendly to internationalization.
Most application software these day should prefer a unicode encoding. UTF-8 is compatible with your <<"StringLiteral">> for the first 128 codepoints, but not for the second 128, so be careful. You might be surprised what you see on your UTF-8 encoded web applications if you use <<"StrïngLïteral">> in your code.
There was an EEP proposal for binary support in the form of <<"StrïngLïteral"/utf8>>, but I don't think this is finalized.
Also be aware that your binary:split/2 function may have unexpected results in UTF-8 if there is a multi-byte character that contains the IS0-8859-1 byte that to are splitting on.
Some would argue that UTF-16 is a better encoding to use because it can be parsed more efficiently and can be more easily split by index, if you are assuming or verify that there are no 32-bit characters.
The unicode module should be use, but tread carefully when you use literals.
The only thing to be aware of is that a binary is a slice of bytes, whereas a list is a list of unicode codepoints. In other words, the latter is naturally unicode whereas the former requires you to do some sort of encoding, usually UTF-8.
To my knowledge, there is no drawbacks to your method.
Binaries are very efficient structures to store strings. If they are longer than 64B they are also stored outside process heap so they are not object of GC (still GC'ed by ref counting when last ref lost). Don't forget use iolists for concatenation them to avoid copying when performance matter.
The Haskell 2010 Language Report says:
Haskell uses the Unicode [2] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.
Does this mean UTF-8?
In ghc-7.0.4/compiler/parser/Lexer.x.source:
$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetChar.
$whitechar = [\ \n\r\f\v $unispace]
$white_no_nl = $whitechar # \n
$tab = \t
$ascdigit = 0-9
$unidigit = \x03 -- Trick Alex into handling Unicode. See alexGetChar.
$decdigit = $ascdigit -- for now, should really be $digit (ToDo)
$digit = [$ascdigit $unidigit]
$special = [\(\)\,\;\[\]\`\{\}]
$ascsymbol = [\!\#\$\%\&\*\+\.\/\<\=\>\?\#\\\^\|\-\~]
$unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetChar.
$symbol = [$ascsymbol $unisymbol] # [$special \_\:\"\']
$unilarge = \x01 -- Trick Alex into handling Unicode. See alexGetChar.
$asclarge = [A-Z]
$large = [$asclarge $unilarge]
$unismall = \x02 -- Trick Alex into handling Unicode. See alexGetChar.
$ascsmall = [a-z]
$small = [$ascsmall $unismall \_]
$unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetChar.
$graphic = [$small $large $symbol $digit $special $unigraphic \:\"\']
...I'm not sure what to make of this. alexGetChar wasn't really helpful.
There was a proposal to standardize on UTF-8 as the standard encoding of Haskell source files, but I'm not sure if it was accepted or not.
In practice, GHC assumes all input files are UTF-8, but it ignores malformed byte sequences in comments.
Unicode is character set. UTF-8, UTF-16 etc are the concrete physical encodings of Unicode codepoints. Try to read here. The difference explained pretty well there.
Cited report's part just states that Haskell sources use Unicode character set. It doesn't state which encoding should be used at all. In other words, it says which characters could appear in the sources, but doesn't say how they could be written in term of plain bytes.
While the Haskell standard simply says Unicode the set of possible characters (as opposed to e.g. ASCII or Latin-1) it doesn't specify which of the several different encodings (UTF8 UTF16, UTF32, byte order) to use.
Alex, the lexer that comes with the Haskell Platform requires its input to be UTF8 encoded * which is why you see the code you mention. In practice I think all the major implementations of Haskell require source to be in UTF8.
* - This is actually a real problem as GHC stores strings and more importantly Data.Text internally as UTF16. It would be nice to be able to lex these directly rather then converting back and forth.
There is an important distinction between the data type (i.e. what “abstract” data you can work with) and its representation (i.e. how it is stored in the computer memory or on disk).
The Haskell Report says two things related to Unicode:
That the Char data type in Haskell represents a Unicode character (also known as code point). You should think of it as of an abstract data type that provides a certain interface (e.g. you can call isDigit or toLower on it), but you are not allowed to know how exactly it is represented internally. The specific implementation of Haskell (e.g. GHC) is free to represent it in memory in whatever way it wants and it doesn’t matter at all, as you can’t access the underlying raw bits anyway.
That a Haskell program is text, consisting of (abstract) Unicode code points, that is, essentially, a String. And then it goes on to explain how to parse this String. Once again, it is important to stress that it defines the syntax of Haskell in terms of sequences of abstract Unicode code points.
Now, to your question about Haskell source code. The Haskell Report does not specify how this Unicode text is encoded into zeroes and ones when stored in a file.
In fact, the Haskell Report does not specify how Haskell programs are stored at all! It doesn’t mention that Haskell source code is stored in files, that files have to be named after modules, and that the directory structure should follow the structure of module names – these all are considered to be compiler implementation details, and the idea is that this allows each compiler to store Haskell programs wherever and however they want: in files, in database tables, as jpeg photos of a blackboard with a program written on it with chalk. For this reason it does not specify the encoding either (it would make no sense to specify the encoding for a program written out on a blackboard 😕).
However, GHC, the de-facto standard Haskell compiler, assumes that Haskell programs are stored in files encoded as UTF-8, organised hierarchically, and named after module names.
I wanted to write some educational code in Haskell with Unicode characters (non-Latin) in the identifiers. (So that the identifiers look nice and natural for speakers of a natural language other than English which is not using the Latin characters in its writing.) So, I set out for finding an appropriate Haskell implementation that would allow this.
But where is this feature specified in the language specification? How would I refer to this feature when looking for a conforming implementation? (And which Haskell implemenations are known to actually support Unicode identifiers?)
It turned out that one Haskell implementation did accept my code with Unicode identifiers, whereas another one failed to accept it. I would like it if there were a way to formalize this requirement of my code, in a form of a language feature switch perhaps, so that if I or someone else tries to run my code, it would be immediately clear whether his implementation is missing the required feature and hence he should look for another one. (There could be also a wiki page for this feature--"Unicode identifiers", which would list which of the existing implementations support it, so that one would know where to go if one needs it.)
(BTW, I have put a "syntax" tag on this question, but I actually perceive it to be an issue of the level of lexing, a lower level than the syntax of a language. Is there a tag here for features of the lexing level of a language, rather than for features of the syntax specification of a language?)
The Online Report documents this under Lexemes. It also notes early on that "Haskell uses the Unicode character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.".
Actual compilers may or may not support Unicode identifiers. GHC does, but you need to keep in mind that Unicode codepoints must obey the same rules as ASCII characters: types must start with a codepoint which is classed as uppercase or titlecase, variables as lowercase (although de facto this is relaxed to alphabetic and not uppercase/titlecase; this might be worth asking for a clarification from the language committee), operators must be punctuation or symbol. (This means that you can't declare types in Arabic, for example, unless you prefix them with a character in some other script that is uppercase/titlecase.)
As to collecting Unicode support information: while I don't know of a single page that provides it, searching for "unicode" on the Haskell Wiki finds information about Unicode support in a number of Haskell compilers.
What's the best way to determine the native newline characters such as '\n' or '\r\n' in Haskell?
I see there is a "nativeNewline" function in GHC.IO:Handle, but assume that it is both a private API and most of all non-standard Haskell.
You should think of the newline representation as part of the encoding of a text file that is stored in the filesystem, just like UTF-8. A text file is normally decoded when you read it into your program, and encoded when written -- converting to and from the native newline representation is done as part of this encoding and decoding. Inside your Haskell program, just as characters are represented by their Unicode code points, the newline character is always \n.
To tell the I/O system about the newline encoding you want to use, see the section on Newline Conversion in the documentation for System.IO.
System.IO.nativeNewline is not private - you can access it
to find out what GHC considers the native "newline" to be
on the current platform.
Note that the type of this variable, System.IO.Newline, does
not have a Show instance as of GHC 6.12.3. So you can't
easily print its value. Instead, check to see if it is equal
to System.IO.LF or System.IO.CRLF.
However, as Simon pointed out, you shouldn't need
to know about the native newline sequence with normal
usage of the text-oriented IO functions
in GHC.
This variable, together with the rest of the new Unicode-aware
capabilities of the IO system, is not yet part of the Haskell standard.
It was not included in the
Haskell 2010 report.
However, since it is already implemented in GHC,
and there is quite a widespread consensus that it is
important and useful, expect it to be included in one of the
upcoming yearly revisions of the standard.