Note that this question is the same as this previously unanswered question.
It is also the same as this PHP question, but I'm looking for the haskell equivalent.
RFC 2047 defines the standard for "encoded-word" encodings and provides an example of:
=?iso-8859-1?q?this=20is=20some=20text?=
Is there a standard haskell library for dealing with decoding this into it's correct Text representation?
This shouldn't be too hard to write a custom parser using parsec and the RFC Spec, but this seems like a common, solved problem in other languages that I cannot find a Haskell equivalent for, and I'd rather not re-invent the wheel here.
In the mime package have look at decodeWord in the module Codec.MIME.Decode:
ghci> import Codec.MIME.Decode
ghci> decodeWord "=?iso-8859-1?q?this=20is=20some=20text?="
Just ("this is some text","")
From reading the source code both iso-8859-1 and us-ascii are supported.
There is also the decodeWords which uses the decodeWord function to translate a entire String:
ghci> decodeWords "Foo=?iso-8859-1?q?this=20is=20some=20text?=Bar"
"Foothis is some textBar"
Related
Do you know the fastest way to encode and decode UTF8 with some extra information? Here's the interesting cases that occur to me:
Serialization
I just want to encode an opaque buffer with no validation so I can decode again later. The fastest would be to use the underlying memory buffer and somehow unsafely coerce it from Text to ByteString without touching the contents.
Probably ASCII
I guess that 99% of the time my UTF8 is actually ASCII so it makes sense to do a first pass to confirm this and only further processing if it's found not to be true.
Probably not ASCII
Converse of the previous.
Probably short
A single key in JSON or a database that I guess will be 1 to 20 characters. Would be silly pay some upfront cost like vectorized SIMD approach.
Probably long
An HTML document. It's worth it pay some upfront cost for the highest throughput.
There's some more variants that are similar like if encoding JSON or URL and you think there's probably no escape characters.
I'm asking this question under the [Haskell] tag since Haskell's strong typing makes some techniques that would be easy in, say, C hard to implement. Also, there may be some special GHC tricks like using SSE4 instructions on an Intel platform that would be interesting. But this is more of a UTF8 issue in general and good ideas would be helpful to any language.
Update
After some research I propose to implement encode and decode for serialization purposes like so:
myEncode :: Text -> ByteString
myEncode = unsafeCoerce
myDecode :: ByteString -> Text
myDecode = unsafeCoerce
This is a great idea if you enjoy segfault ...
This question implicates a sprawling range of issues. I'm going to interpret it as "In Haskell, how should I convert between Unicode and other character encodings?"
In Haskell, the recommended way to convert to and from Unicode is with the functions in text-icu, which provides some basic functions:
fromUnicode :: Converter -> Text -> ByteString
toUnicode :: Converter -> ByteString -> Text
text-icu is a binding to the International Components for Unicode libraries, which does the heavy lifting for, among other things, encoding and decoding to non-Unicode character sets. Its website gives documentation on conversion in general and some specific information on how its converter implementations operate. Note that different character sets require somewhat different coverter implementations.
ICU can also attempt to automatically detect the character set of an input. "This is, at best, an imprecise operation using statistics and heuristics." No other implementation could "fix" that characteristic. The Haskell bindings do not expose that functionality as I write; see #8.
I don't know of any character set conversion procedures written in native Haskell. As the ICU documentation indicates, there is a lot of complexity; after all, this is a rich area of international computing history.
Performance
As the ICU FAQ laconically notes, "Most of the time, the memory throughput of the hard drive and RAM is the main performance constraint." Although that comment is not specifically about conversions, I'd expect it to be broadly the case here as well. Is your experience otherwise?
unsafeCoerce is not appropriate here.
Are there any tools available for producing a parse tree of NodeJS code? The closest I can find is the closure compiler but I don't think that will give me a parseable tree for analysis.
Node.js is just JavaScript (aka ECMAScript). I recommend Esprima (http://esprima.org/), which supports the de-facto standard AST established by Mozilla.
Esprima generates very nice ASTs, but if you need the parse tree you must look elsewhere. Esprima only returns ASTs and the sequence of tokens for some text. If you don't want to model the language yourself, you could use another tool like ANTLR (see: https://stackoverflow.com/a/5982455/206543).
For the difference between ASTs and parse trees, look here: https://stackoverflow.com/a/9864571/206543
Haddock seems to incorrectly re-encode non-ASCII characters in the documentation in UTF-8 encoded source files. I often need to include mathematical formulas in the documentation and they are much more readable if some common math symbols such as summation (∑) can be used.
However, after running the files through haddock, these symbols become blank squares.
Haddock has the option --use-unicode but that just converts function arrows in function signatures etc. into unicode characters, while still breaking the actually documentation.
Even better would be if this can be controlled from cabal haddock!
I'm using Haddock version 2.9.4.
Note that Haddock uses the GHC API to do parsing. Non-ASCII characters in comments are not handled properly by GHC < 7.4, but it seems that with GHC 7.4 it works fine.
If UTF-8 cannot be used and numeric character references like ∑ or ∑ (these are correct references for the n-ary summation symbol ∑) are regarded as unreadable, then I’m afraid the only option is to use named references like ∑, if they get passed thru to the HTML result and are supported by the browser(s) that will be used.
That’s a big “if,” since the new HTML5 entities have rather limited support, but perhaps in an intranet where everyone uses Firefox... HTML5 entities:
http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html
(And most of the references are not as mnemonic as ∑.)
The Haskell 2010 Language Report says:
Haskell uses the Unicode [2] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.
Does this mean UTF-8?
In ghc-7.0.4/compiler/parser/Lexer.x.source:
$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetChar.
$whitechar = [\ \n\r\f\v $unispace]
$white_no_nl = $whitechar # \n
$tab = \t
$ascdigit = 0-9
$unidigit = \x03 -- Trick Alex into handling Unicode. See alexGetChar.
$decdigit = $ascdigit -- for now, should really be $digit (ToDo)
$digit = [$ascdigit $unidigit]
$special = [\(\)\,\;\[\]\`\{\}]
$ascsymbol = [\!\#\$\%\&\*\+\.\/\<\=\>\?\#\\\^\|\-\~]
$unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetChar.
$symbol = [$ascsymbol $unisymbol] # [$special \_\:\"\']
$unilarge = \x01 -- Trick Alex into handling Unicode. See alexGetChar.
$asclarge = [A-Z]
$large = [$asclarge $unilarge]
$unismall = \x02 -- Trick Alex into handling Unicode. See alexGetChar.
$ascsmall = [a-z]
$small = [$ascsmall $unismall \_]
$unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetChar.
$graphic = [$small $large $symbol $digit $special $unigraphic \:\"\']
...I'm not sure what to make of this. alexGetChar wasn't really helpful.
There was a proposal to standardize on UTF-8 as the standard encoding of Haskell source files, but I'm not sure if it was accepted or not.
In practice, GHC assumes all input files are UTF-8, but it ignores malformed byte sequences in comments.
Unicode is character set. UTF-8, UTF-16 etc are the concrete physical encodings of Unicode codepoints. Try to read here. The difference explained pretty well there.
Cited report's part just states that Haskell sources use Unicode character set. It doesn't state which encoding should be used at all. In other words, it says which characters could appear in the sources, but doesn't say how they could be written in term of plain bytes.
While the Haskell standard simply says Unicode the set of possible characters (as opposed to e.g. ASCII or Latin-1) it doesn't specify which of the several different encodings (UTF8 UTF16, UTF32, byte order) to use.
Alex, the lexer that comes with the Haskell Platform requires its input to be UTF8 encoded * which is why you see the code you mention. In practice I think all the major implementations of Haskell require source to be in UTF8.
* - This is actually a real problem as GHC stores strings and more importantly Data.Text internally as UTF16. It would be nice to be able to lex these directly rather then converting back and forth.
There is an important distinction between the data type (i.e. what “abstract” data you can work with) and its representation (i.e. how it is stored in the computer memory or on disk).
The Haskell Report says two things related to Unicode:
That the Char data type in Haskell represents a Unicode character (also known as code point). You should think of it as of an abstract data type that provides a certain interface (e.g. you can call isDigit or toLower on it), but you are not allowed to know how exactly it is represented internally. The specific implementation of Haskell (e.g. GHC) is free to represent it in memory in whatever way it wants and it doesn’t matter at all, as you can’t access the underlying raw bits anyway.
That a Haskell program is text, consisting of (abstract) Unicode code points, that is, essentially, a String. And then it goes on to explain how to parse this String. Once again, it is important to stress that it defines the syntax of Haskell in terms of sequences of abstract Unicode code points.
Now, to your question about Haskell source code. The Haskell Report does not specify how this Unicode text is encoded into zeroes and ones when stored in a file.
In fact, the Haskell Report does not specify how Haskell programs are stored at all! It doesn’t mention that Haskell source code is stored in files, that files have to be named after modules, and that the directory structure should follow the structure of module names – these all are considered to be compiler implementation details, and the idea is that this allows each compiler to store Haskell programs wherever and however they want: in files, in database tables, as jpeg photos of a blackboard with a program written on it with chalk. For this reason it does not specify the encoding either (it would make no sense to specify the encoding for a program written out on a blackboard 😕).
However, GHC, the de-facto standard Haskell compiler, assumes that Haskell programs are stored in files encoded as UTF-8, organised hierarchically, and named after module names.
What's the best way to determine the native newline characters such as '\n' or '\r\n' in Haskell?
I see there is a "nativeNewline" function in GHC.IO:Handle, but assume that it is both a private API and most of all non-standard Haskell.
You should think of the newline representation as part of the encoding of a text file that is stored in the filesystem, just like UTF-8. A text file is normally decoded when you read it into your program, and encoded when written -- converting to and from the native newline representation is done as part of this encoding and decoding. Inside your Haskell program, just as characters are represented by their Unicode code points, the newline character is always \n.
To tell the I/O system about the newline encoding you want to use, see the section on Newline Conversion in the documentation for System.IO.
System.IO.nativeNewline is not private - you can access it
to find out what GHC considers the native "newline" to be
on the current platform.
Note that the type of this variable, System.IO.Newline, does
not have a Show instance as of GHC 6.12.3. So you can't
easily print its value. Instead, check to see if it is equal
to System.IO.LF or System.IO.CRLF.
However, as Simon pointed out, you shouldn't need
to know about the native newline sequence with normal
usage of the text-oriented IO functions
in GHC.
This variable, together with the rest of the new Unicode-aware
capabilities of the IO system, is not yet part of the Haskell standard.
It was not included in the
Haskell 2010 report.
However, since it is already implemented in GHC,
and there is quite a widespread consensus that it is
important and useful, expect it to be included in one of the
upcoming yearly revisions of the standard.