Where is it specified whether Unicode identifiers should be allowed in a Haskell implementation? - haskell

I wanted to write some educational code in Haskell with Unicode characters (non-Latin) in the identifiers. (So that the identifiers look nice and natural for speakers of a natural language other than English which is not using the Latin characters in its writing.) So, I set out for finding an appropriate Haskell implementation that would allow this.
But where is this feature specified in the language specification? How would I refer to this feature when looking for a conforming implementation? (And which Haskell implemenations are known to actually support Unicode identifiers?)
It turned out that one Haskell implementation did accept my code with Unicode identifiers, whereas another one failed to accept it. I would like it if there were a way to formalize this requirement of my code, in a form of a language feature switch perhaps, so that if I or someone else tries to run my code, it would be immediately clear whether his implementation is missing the required feature and hence he should look for another one. (There could be also a wiki page for this feature--"Unicode identifiers", which would list which of the existing implementations support it, so that one would know where to go if one needs it.)
(BTW, I have put a "syntax" tag on this question, but I actually perceive it to be an issue of the level of lexing, a lower level than the syntax of a language. Is there a tag here for features of the lexing level of a language, rather than for features of the syntax specification of a language?)

The Online Report documents this under Lexemes. It also notes early on that "Haskell uses the Unicode character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.".
Actual compilers may or may not support Unicode identifiers. GHC does, but you need to keep in mind that Unicode codepoints must obey the same rules as ASCII characters: types must start with a codepoint which is classed as uppercase or titlecase, variables as lowercase (although de facto this is relaxed to alphabetic and not uppercase/titlecase; this might be worth asking for a clarification from the language committee), operators must be punctuation or symbol. (This means that you can't declare types in Arabic, for example, unless you prefix them with a character in some other script that is uppercase/titlecase.)
As to collecting Unicode support information: while I don't know of a single page that provides it, searching for "unicode" on the Haskell Wiki finds information about Unicode support in a number of Haskell compilers.

Related

When to use Unicode Normalization Forms NFC and NFD?

The Unicode Normalization FAQ includes the following paragraph:
Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and NFD.
and continues...
The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.
My questions are:
What makes NFC best for "general text." What defines "internal processing" and why is it best left to NFD? And finally, never minding what is "best," are the two forms interchangable as long as two strings are compared using the same normalization form?
The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.
In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.
For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.
There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.
It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.
NFC is the general common sense form that you should use, ä is 1 code point there and that makes sense.
NFD is good for certain internal processing - if you want to make accent-insensitive searches or sorting, having your string in NFD makes it much easier and faster. Another usage is making more robust slug titles. These are just the most obvious ones, I am sure there are plenty of more uses.
If two strings x and y are canonical equivalents, then
toNFC(x) = toNFC(y)
toNFD(x) = toNFD(y)
Is that what you meant?

The difference between Chomsky type 3 and Chomsky type 2 grammar

I'm having trouble articulating the difference between Chomsky type 2 (context free languages) and Chomsky type 3 (Regular languages).
Can someone out there give me an answer in plain English? I'm having trouble understanding the whole hierarchy thing.
A Type II grammar is a Type III grammar with a stack
A Type II grammar is basically a Type III grammar with nesting.
Type III grammar (Regular):
Use Case - CSV (Comma Separated Values)
Characteristics:
can be read with a using a FSM (Finite State Machine)
requires no intermediate storage
can be read with Regular Expressions
usually expressed using a 1D or 2D data structure
is flat, meaning no nesting or recursive properties
Ex:
this,is,,"an "" example",\r\n
"of, a",type,"III\n",grammar\r\n
As long as you can figure out all of the rules and edge cases for the above text you can parse CSV.
Type II grammar (Context Free):
Use Case - HTML (Hyper Text Markup Language) or SGML in general
Characteristics:
can be read using a DPDA (Deterministic Pushdown Automata)
will require a stack for intermediate storage
may be expressed as an AST (Abstract Syntax Tree)
may contain nesting and/or recursive properties
HTML could be expressed as a regular grammar:
<h1>Useless Example</h1>
<p>Some stuff written here</p>
<p>Isn't this fun</p>
But it's try parsing this using a FSM:
<body>
<div id=titlebar>
<h1>XHTML 1.0</h1>
<h2>W3C's failed attempt to enforce HTML as a context-free language</h2>
</div>
<p>Back when the web was still pretty boring, the W3C attempted to standardize away the quirkiness of HTML by introducing a strict specification</p
<p>Unfortunately, everybody ignored it.</p>
</body>
See the difference? Imagine you were writing a parser, you could start on an open tag and finish on a closing tag but what happens when you encounter a second opening tag before reaching the closing tag?
It's simple, you push the first opening tag onto a stack and start parsing the second tag. Repeat this process for as many levels of nesting that exist and if the syntax is well-structured, the stack can be un-rolled one layer at a time in the opposite level that it was built
Due to the strict nature of 'pure' context-free languages, they're relatively rare unless they're generated by a program. JSON, is a prime example.
The benefit of context-free languages is that, while very expressive, they're still relatively simple to parse.
But wait, didn't I just say HTML is context-free. Yep, if it is well-formed (ie XHTML).
While XHTML may be considered context-free, the looser-defined HTML would actually considered Type I (Ie Context Sensitive). The reason being, when the parser reaches poorly structured code it actually makes decisions about how to interpret the code based on the surrounding context. For example if an element is missing its closing tags, it would need to determine where that element exists in the hierarchy before it can decide where the closing tag should be placed.
Other features that could make a context-free language context-sensitive include, templates, imports, preprocessors, macros, etc.
In short, context-sensitive languages look a lot like context-free languages but the elements of a context-sensitive languages may be interpreted in different ways depending on the program state.
Disclaimer: I am not formally trained in CompSci so this answer may contain errors or assumptions. If you asked me the difference between a terminal and a non-terminal you'll earn yourself a blank stare. I learned this much by actually building a Type III (Regular) parser and by reading extensively about the rest.
The wikipedia page has a good picture and bullet points.
Roughly, the underlying machine that can describe a regular language does not need memory. It runs as a statemachine (DFA/NFA) on the input. Regular languages can also be expressed with regular expressions.
A language with the "next" level of complexity added to it is a context free language. The underlying machine describing this kind of language will need some memory to be able to represent the languages that are context free and not regular. Note that adding memory to your machine makes it a little more powerful, so it can still express languages (e.g. regular languages) that didn't need the memory to begin with. The underlying machine is typically a push-down automaton.
Type 3 grammars consist of a series of states. They cannot express embedding. For example, a Type 3 grammar cannot require matching parentheses because it has no way to show that the parentheses should be "wrapped around" their contents. This is because, as Derek points out, a Type 3 grammar does not "remember" anything about the previous states that it passed through to get to the current state.
Type 2 grammars consist of a set of "productions" (you can think of them as patterns) that can have other productions embedded within them. Thus, they are recursively defined. A production can only be defined in terms of what it contains, and cannot "see" outside of itself; this is what makes the grammar context-free.

Use of unicode characters in Haddock documentation

Haddock seems to incorrectly re-encode non-ASCII characters in the documentation in UTF-8 encoded source files. I often need to include mathematical formulas in the documentation and they are much more readable if some common math symbols such as summation (∑) can be used.
However, after running the files through haddock, these symbols become blank squares.
Haddock has the option --use-unicode but that just converts function arrows in function signatures etc. into unicode characters, while still breaking the actually documentation.
Even better would be if this can be controlled from cabal haddock!
I'm using Haddock version 2.9.4.
Note that Haddock uses the GHC API to do parsing. Non-ASCII characters in comments are not handled properly by GHC < 7.4, but it seems that with GHC 7.4 it works fine.
If UTF-8 cannot be used and numeric character references like ∑ or &­#x2211; (these are correct references for the n-ary summation symbol ∑) are regarded as unreadable, then I’m afraid the only option is to use named references like ∑, if they get passed thru to the HTML result and are supported by the browser(s) that will be used.
That’s a big “if,” since the new HTML5 entities have rather limited support, but perhaps in an intranet where everyone uses Firefox... HTML5 entities:
http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html
(And most of the references are not as mnemonic as ∑.)

Does case sensitivity have anything to do with strongly typed languages (or loosely typed languages)?

(I admit this may be a n00b question - I know very little about CS theory, mostly a hands-on/hobby sort.)
I was googling up strongly-typed language for the official definition, and one of the top links I found was from Yahoo Answers, which suggested that case sensitive was a part of whether a language is loosely/strongly typed.
I had always thought the simple answer to the difference between a strongly typed/weakly typed language is that the first requires explicit type declarations, while the later is more open, even "dynamic".
The two S/O threads (here and here) I found so far seem to suggest that (more or less), but they don't mention anything about case sensitivity. Is there a relation at all between case sensitive and strong/weak?
A couple of clarifications:
Case sensitivity has nothing to do with strong vs. weak typing, static vs. dynamic typing or any other property of the type system. I don't know why the answer on yahoo answers has gotten its one upvote, but it's completely wrong. Just ignore it.
Strong typing isn't a well-defined term, but it is often used to refer to languages with few implicit type conversions, i.e. languages where it is an error to perform operations on types that do not support that operation.
As an example multiplying the strings "foo" and "bar" gives 0 as the result in perl, while it causes a type error in ruby, python, java, haskell, ml and many other languages. Thus those languages are more strongly typed than perl.
Strong typing is also sometimes used as a synonym for static typing.
A statically typed language is a language in which the types of variables, functions and expressions are known at compile time (or before runtime anyway - a statically typed language need not be compiled per se, though in practice it usually is). This means that if a statically typed program contains a type error, it will not run. If a dynamically typed program contains a type error it will run up to the point where the error happens and then crash.
Whether a language requires type annotations is (somewhat) independent of whether its type system is strong or weak or static or dynamic. In theory a dynamically typed language could require (or at least allow) type annotations and then throw runtime errors when those annotations are broken (though I don't know of any dynamically that actually does this).
More importantly there are many statically and strongly typed languages (e.g. Haskell, ML) that don't require type annotations, but instead use type inference algorithms to infer the types.
In theory, case sensitivity is completely unrelated to type strictness. Case sensitivity is about whether the identifiers foo, FOO, and fOo refer to the same variable, function, or what-have-you. Type strictness is about whether variables have types or just values do, how easy it is to convert among types, and so on.
In practice, there might be a correlation between case sensitivity and type strictness, but I can't think of enough case-insensitive languages right now to make an assessment. My impression is that most languages commonly used today are case sensitive — possibly because C was case sensitive and very influential, possibly because it was the only way to force people to stop PROGRAMMING IN ALL CAPS after a couple decades of FORTRAN, COBOL, and BASIC.
No - they're not connected. Strongly type languages force you to specify the type of data that a variable may hold - such as a real number, an integer, a textual string, or some programmer-defined object. You they can't accidentally assign another type of data into that variable unless it is implicitly convertible: examples of this are that you can generally put a integer into a real number (i.e. double x = 3.14; x = 3; is ok but int x = 3; x = 3.14; might not be, depending on how strongly typed the langauge is). Weakly typed languages just store whatever they're asked to without doing these sanity checks. In strongly typed languages like C++, you can still create type that can store data that can be any of a specific set of types (e.g. C++'s boost::variant), but sometimes they're a bit more limited in how much you can do and how convenient it is to use.
Case sensitivity is means that the uppercase and lowercase versions of the same letter are considered equivalent for some purposes... normally in a string comparison or regular expression match. It is unusual but not unheard of for modern computer languages to ignore the case of letters in variable names (identifiers).

Are there programming languages that rely on non-latin alphabets?

Every programming language I have ever seen has been based on the Latin alphabet, this is not surprising considering I live in Canada...
But it only really makes sense that there would be programming languages based on other alphabets, or else bright computer scientists across the world would have to learn a new alphabet to go on in the field. I know for a fact that people in countries dominated by other alphabets develop languages based off the Latin alphabet (eg. Ruby from Japan), but just how common is it for programming languages to be based off of other alphabets like Arabic, or Cyrillic, or even writing systems which are not alphabetic but rather logographic in nature such as Japanese Kanji?
Also are any of these languages in active widespread use, or are they mainly used as teaching tools?
This is something that has bugged me since I started programming, and I have never run across someone who could think of a real answer.
Have you seen Perl?
APL is probably the most widely known. It even has a cool keyboard overlay (or was it a special keyboard you had to buy?):
In the non-alphabetic category, we also have programming languages like LabVIEW, which is mostly graphical. (You can label objects, and you can still do string manipulation, so there's some textual content.) LabVIEW has been used in data acquisition and automation for years, but gained a bit of popularity when it became the default platform for Lego Mindstorms.
There's a list on Wikipedia. I don't think any of them is really prevalent though. Many programmers can learn to write programs with english keywords even if they didn't understand the language. Ruby is a good example, you'll still see Japanese identifiers and comments in some Ruby code.
Well, Brainf* uses no latin characters, if you'll pardon the language...and the pun.
Many languages allow Unicode identifiers. It's part of standard Java, and both g++ (though you have to use \uNNNN escapes) and MSVC++ allow them (see also this question) And some allow using #define (or maybe better) to rename control structures.
But in reality, people don't do this for the most part. See past questions such as Language of variable names?, Should all code be written in English?, etc.
Agda.
Sample Snippet:
mutual
data ωChain : Set where
_∷_,_ : ∀ (x : carrier) (xω : ∞ ωChain) (p : x ≼ xω) → ωChain
head : ωChain → carrier
head (x ∷ _ , _) = x
_≼_ : carrier → ∞ ωChain → Set
x ≼ xω = x ≤ head (♭ xω)
Well, there's always APL. That has its own UNICODE characters, and I believe it used to require a special keyboard too.
There'is one langauge used in russian ERP system called after company, which developed it 1C. But it's identifiers and operators has english analogs.
Also, I know that haskell has unicode identifiers support, so you can write programs in any alphabet. But this is not useful (My native language is russian). It's quite enough that you have to type program messages and helpful comments in native alphabet.
Other people are answering with languages that use punctuation marks in addition to Latin letters. I wonder why no one mentioned digits 0 to 9 as well.
In some languages, and in some implementations of some languages, programmers can use a wide range of characters in identifiers, such as Arabic or Chinese characters. This doesn't mean that the language relies on them though.
In most languages, programmers can use a wide range of characters in string literals (in quotation marks) and in comments. Again this doesn't mean that the language relies on them.
In every programming language that I've seen, the language does rely on punctuation marks and digits. So this answers your question but not in the way you expect.
Now let's try to find something meaningful. Is there a programming language where keywords are chosen from non-Latin alphabets? I would guess not, except maybe for joke languages. What would be the point of inventing a programming language that makes it impossible for some programmers to even input a program?
EDIT: My guess is wrong. Besides APL's usage of various invented punctuation marks, it does depend on a few Greek keywords, where each keyword is one letter long, such as the letter rho.
I just found an interesting wiki for "esoteric programming languages".

Resources