Character encoding issue in GHC - haskell

When I try and read a plaintext file from within my Haskell program I get:
[fromList * Exception: /path/to/file/aaa.txt hGetContents: invalid argument (Invalid or incomplete multibyte or wide character)
I googled to find this problem is usually set right by setting LANG to en_US.UTF-8
That's already how my locale looks.
Not sure if this is an issue with GHC at all.
I am on Ubuntu 11.10

Are you sure aaa.txt contains valid UTF-8? If it's binary data, you need to use withBinaryFile or similar. If it is text in another encoding, you should use hSetEncoding.
For instance, if your text is in Latin-1 then you would say
hSetEncoding h latin1
where "h" is your file handle. If you are reading from standard input then its
hSetEncoding stdin latin1
There is also a mkTextEncoding function which you can use if you have read the encoding from metadata, or wish to customise the handling of invalid Unicode (although this only works on some systems).
The Unicode standards say that a Unicode parser should reject invalid strings with an error, rather than trying to fix them up. This is a deliberate rejection of Postel's Law, on the grounds of reducing security holes and inconsistent interpretations.
(You might want to consider using the text library if you'll be working with a lot of text and having to handle encoding issues; it's usually a lot faster than using Strings, since it uses an unboxed array rather than a linked list, although this means that Text values and operations on them are necessarily strict. It also lets you configure how to respond to invalid Unicode more portably and flexibly.)

Related

why does this string not contain the correct characters?

Written in Delphi XE3, my software is communicating with an instrument that occasionally sends binary data. had expected I should use AnsiString since this data will never be Unicode. I couldn't believe that the following code doesn't work as I had expected. I'm supposing that the characters I'm exposing it to are considered illegitimate...
var
s:AnsiString;
begin
s:='test' + chr(128);
// had expected that since the string was set above to end in #128,
// it should end in #128...it does not.
if ord(s[5])<>128 then
ShowMessage('String ending is not as expected!');
end;
Naturally, I could use a pointer to accomplish this but I would think I should probably be using a different kind of string. of course, I could use a byte array but a string would be more convenient.
really, I'd like to know "why" and have some good alternatives.
thanks!
The behaviour you observe stems from the fact that Chr(128) is a UTF-16 WideChar representing U+0080.
When translated to your ANSI locale this does not map to ordinal 128. I would expect U+0080 to have no equivalent in your ANSI locale and therefore map to ? to indicate a failed translation.
Indeed the compiler will even warn you that this can happen. You code when compiled with default compiler options yields these warnings:
W1058 Implicit string cast with potential data loss from 'string' to 'AnsiString'
W1062 Narrowing given wide string constant lost information
Personally I would use configure the warnings to treat both of those warnings as errors.
The fundamental issue is revealed here:
My software is communicating with an instrument that occasionally sends binary data.
The correct data type for byte oriented binary data is an array of byte. In Delphi that would be TBytes.
It is wrong to use AnsiString since that exposes you to codepage translations. You want to be able to specify ordinal values and you categorically do not want text encodings to play a part. You do not want for your program's behaviour to be determined by the prevailing ANSI locale.
Strings are for text. For binary use byte arrays.

linux libiconv transcode from ISO8859 or IBM850 to UTF8 error

I don't know what the original code is, so I assume that the original code is IBM850 or ISO8859-1.My process below
IBM850 -> UTF8
if this is OK, I consider the original code is IBM850, if NOK,do next step:
ISO8859-1 -> UTF8
if this is OK, I consider the original code is UTF8.
But there is a problem,
if the original code is ISO8859-1, it will be recognised to IBM850.
if the original code is IBM850, it will be recognised to ISO8859-1.
It seems that there are common ground between IBM850 and ISO8859-1.
Who can help me, thanks.
Yes, only the most trivial kind of autodetection is possible by testing whether conversion fails or succeeds. It's not going to work for input encodings where (almost) any input is valid.
You should know something more about your likely output, to test whether if it makes more sense after translating from IBM850 or from ISO8859-1. That's what enca and libenca do. You can probably start with some simple expectations to check:
Does your source happen to be within the ASCII subset of both encodings? Then you're happy with any conversion (but you have no way to know the original encoding at all).
Does your code use box drawing characters? If it does not, it would be easy to reject some candidates for IBM850.
Does your code use control characters from ISO8859-1? If it does not, it would be easy to reject some candidates for ISO8859-1 if codepoints 0x80-0x9F are used.
Do the fragments of your code which are non-ASCII always represent a text in a natural language? Then you can use frequency tables for characters and their pairs, selecting the source encoding which makes the result closer to your natural language(s) on these criteria. (If both variants are almost equally acceptable, it's probably better to give an error message and leave the final decision to humans).

Erlang binary strings by default

I am writing an erlang module that has to deal a bit with strings, not too much, however, I do some tcp recv and then some parsing over the data.
While matching data and manipulating strings, I am using binary module all the time like binary:split(Data,<<":">>) and basically using <<"StringLiteral">> all the time.
Till now I have not encounter difficulties or missing methods from the alternative( using lists) and everything is coming out quite naturally except maybe for adding the <<>>, but I was wondering if this way of dealing with strings might have drawbacks I am not aware of.
Any hint?
As long as you and your team remember that your strings are binaries and not lists, there are no inherent problems with this approach. In fact, Couch DB took this approach as an optimization which apparently paid nice dividends.
You do need to be very aware of how your string is encoded in your binaries. When you do <<"StringLiteral">> in your code, you have to be aware that this is simply a binary serialization of the list of code-points. Your Erlang compiler reads your code as ISO-8859-1 characters, so as long as you only use Latin-1 characters and do this consistently, you should be fine, But this isn't very friendly to internationalization.
Most application software these day should prefer a unicode encoding. UTF-8 is compatible with your <<"StringLiteral">> for the first 128 codepoints, but not for the second 128, so be careful. You might be surprised what you see on your UTF-8 encoded web applications if you use <<"StrïngLïteral">> in your code.
There was an EEP proposal for binary support in the form of <<"StrïngLïteral"/utf8>>, but I don't think this is finalized.
Also be aware that your binary:split/2 function may have unexpected results in UTF-8 if there is a multi-byte character that contains the IS0-8859-1 byte that to are splitting on.
Some would argue that UTF-16 is a better encoding to use because it can be parsed more efficiently and can be more easily split by index, if you are assuming or verify that there are no 32-bit characters.
The unicode module should be use, but tread carefully when you use literals.
The only thing to be aware of is that a binary is a slice of bytes, whereas a list is a list of unicode codepoints. In other words, the latter is naturally unicode whereas the former requires you to do some sort of encoding, usually UTF-8.
To my knowledge, there is no drawbacks to your method.
Binaries are very efficient structures to store strings. If they are longer than 64B they are also stored outside process heap so they are not object of GC (still GC'ed by ref counting when last ref lost). Don't forget use iolists for concatenation them to avoid copying when performance matter.

Haskell source encoding

The Haskell 2010 Language Report says:
Haskell uses the Unicode [2] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.
Does this mean UTF-8?
In ghc-7.0.4/compiler/parser/Lexer.x.source:
$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetChar.
$whitechar = [\ \n\r\f\v $unispace]
$white_no_nl = $whitechar # \n
$tab = \t
$ascdigit = 0-9
$unidigit = \x03 -- Trick Alex into handling Unicode. See alexGetChar.
$decdigit = $ascdigit -- for now, should really be $digit (ToDo)
$digit = [$ascdigit $unidigit]
$special = [\(\)\,\;\[\]\`\{\}]
$ascsymbol = [\!\#\$\%\&\*\+\.\/\<\=\>\?\#\\\^\|\-\~]
$unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetChar.
$symbol = [$ascsymbol $unisymbol] # [$special \_\:\"\']
$unilarge = \x01 -- Trick Alex into handling Unicode. See alexGetChar.
$asclarge = [A-Z]
$large = [$asclarge $unilarge]
$unismall = \x02 -- Trick Alex into handling Unicode. See alexGetChar.
$ascsmall = [a-z]
$small = [$ascsmall $unismall \_]
$unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetChar.
$graphic = [$small $large $symbol $digit $special $unigraphic \:\"\']
...I'm not sure what to make of this. alexGetChar wasn't really helpful.
There was a proposal to standardize on UTF-8 as the standard encoding of Haskell source files, but I'm not sure if it was accepted or not.
In practice, GHC assumes all input files are UTF-8, but it ignores malformed byte sequences in comments.
Unicode is character set. UTF-8, UTF-16 etc are the concrete physical encodings of Unicode codepoints. Try to read here. The difference explained pretty well there.
Cited report's part just states that Haskell sources use Unicode character set. It doesn't state which encoding should be used at all. In other words, it says which characters could appear in the sources, but doesn't say how they could be written in term of plain bytes.
While the Haskell standard simply says Unicode the set of possible characters (as opposed to e.g. ASCII or Latin-1) it doesn't specify which of the several different encodings (UTF8 UTF16, UTF32, byte order) to use.
Alex, the lexer that comes with the Haskell Platform requires its input to be UTF8 encoded * which is why you see the code you mention. In practice I think all the major implementations of Haskell require source to be in UTF8.
* - This is actually a real problem as GHC stores strings and more importantly Data.Text internally as UTF16. It would be nice to be able to lex these directly rather then converting back and forth.
There is an important distinction between the data type (i.e. what “abstract” data you can work with) and its representation (i.e. how it is stored in the computer memory or on disk).
The Haskell Report says two things related to Unicode:
That the Char data type in Haskell represents a Unicode character (also known as code point). You should think of it as of an abstract data type that provides a certain interface (e.g. you can call isDigit or toLower on it), but you are not allowed to know how exactly it is represented internally. The specific implementation of Haskell (e.g. GHC) is free to represent it in memory in whatever way it wants and it doesn’t matter at all, as you can’t access the underlying raw bits anyway.
That a Haskell program is text, consisting of (abstract) Unicode code points, that is, essentially, a String. And then it goes on to explain how to parse this String. Once again, it is important to stress that it defines the syntax of Haskell in terms of sequences of abstract Unicode code points.
Now, to your question about Haskell source code. The Haskell Report does not specify how this Unicode text is encoded into zeroes and ones when stored in a file.
In fact, the Haskell Report does not specify how Haskell programs are stored at all! It doesn’t mention that Haskell source code is stored in files, that files have to be named after modules, and that the directory structure should follow the structure of module names – these all are considered to be compiler implementation details, and the idea is that this allows each compiler to store Haskell programs wherever and however they want: in files, in database tables, as jpeg photos of a blackboard with a program written on it with chalk. For this reason it does not specify the encoding either (it would make no sense to specify the encoding for a program written out on a blackboard 😕).
However, GHC, the de-facto standard Haskell compiler, assumes that Haskell programs are stored in files encoded as UTF-8, organised hierarchically, and named after module names.

Native newline characters?

What's the best way to determine the native newline characters such as '\n' or '\r\n' in Haskell?
I see there is a "nativeNewline" function in GHC.IO:Handle, but assume that it is both a private API and most of all non-standard Haskell.
You should think of the newline representation as part of the encoding of a text file that is stored in the filesystem, just like UTF-8. A text file is normally decoded when you read it into your program, and encoded when written -- converting to and from the native newline representation is done as part of this encoding and decoding. Inside your Haskell program, just as characters are represented by their Unicode code points, the newline character is always \n.
To tell the I/O system about the newline encoding you want to use, see the section on Newline Conversion in the documentation for System.IO.
System.IO.nativeNewline is not private - you can access it
to find out what GHC considers the native "newline" to be
on the current platform.
Note that the type of this variable, System.IO.Newline, does
not have a Show instance as of GHC 6.12.3. So you can't
easily print its value. Instead, check to see if it is equal
to System.IO.LF or System.IO.CRLF.
However, as Simon pointed out, you shouldn't need
to know about the native newline sequence with normal
usage of the text-oriented IO functions
in GHC.
This variable, together with the rest of the new Unicode-aware
capabilities of the IO system, is not yet part of the Haskell standard.
It was not included in the
Haskell 2010 report.
However, since it is already implemented in GHC,
and there is quite a widespread consensus that it is
important and useful, expect it to be included in one of the
upcoming yearly revisions of the standard.

Resources