I wonder, if this
alph = ['a'..'z']
returns me
"abcdefghijklmnopqrstuvwxyz"
How can I return French alphabet then? Can I pass somehow a locale?
Update:
Well ) I know that English and French has the same letters. But my point is if they were not the same but starts with A and ends with Z. Would be nice to have human language range support.
At least some languages come with localizations support.
(just trying Haskell, reading a book)
Haskell Char values are not real characters, they are Unicode code points. In some other languages their native character type may represent other things like ASCII characters or "code page whatsitsnumber" characters, or even something selectable at runtime, but not in Haskell.
The range 'a'..'z' coincides with the English alphabet for historical reasons, both in Unicode and in ASCII, and also in character sets derived from ASCII such as ISO8859-X. There is no commonly supported coded character set where some contiguous range of codes coincides with the French alphabet. That is, if you count letters with diacritics as separate letters. The accepted practice seems to exclude letters with diacritics, so the French alphabet coincides with English, but this is not so for other Latin-derived alphabets.
In order to get most alphabets other than English, one needs to enumerate the characters explicitly by hand and not with any range expression. For some languages one even cannot use Char to represent all letters, as some of them need more than one code point, such as Hungarian "ly" or Spanish "ll" (before 2010) or Dutch "ij" (according to some authorities — there's no one commonly accepted definition).
No language that I know supports arbitrary human alphabets as range expressions out of the box.
While programming languages usually support sorting by the current locale (just search for collate on Hackage), there is no library I know that provides a list of alphabetic characters by locale.
Modern (Unicode) systems allowing for localized characters try to also allow many non-latin alphabets, and thus very many alphabetic characters.
Enumerating all alphabetic characters within Unicode gives over 40k characters:
GHCi> length $ filter Data.Char.isAlpha $
map Data.Char.chr [0..256*256]
48408
While I am aware of libraries allowing to construct alphabetic indices, I don't know about any Haskell binding for this feature.
Related
I'm working with some text from twitter, using Tweepy. All that is fine, and at the moment I'm just looking to start with some basic frequency counts for words. However, I'm running into an issue where the ability of users to use different fonts for their tweets is making it look like some words are their own unique word, when in reality they're words that have already been encountered but in a different font/font size, like in the picture below (those words are words that were counted previously and appear in the spreadsheet earlier up).
This messes up the accuracy of the counts. I'm wondering if there's a package or general solution to make all the words a uniform font/size - either while I'm tokenizing it (just by hand, not using a module) or while writing it to the csv (using the csv module). Or any other solutions for this that I may not be considering. Thanks!
You can (mostly) solve your problem by normalising your input, using unicodedata.normalize('NFKC', str).
The KC normalization form (which is what NF stands for) first does a "compatibility decomposition" on the text, which replaces Unicode characters which represent style variants, and then does a canonical composition on the result, so that ñ, which is converted to an n and a separate ~ diacritic by the decomposition, is then turned back into an ñ, the canonical composite for that character. (If you don't want the recomposition step, use NFKD normalisation.) See Unicode Annex 15 for a more precise description, with examples.
Unicode contains a number of symbols, mostly used for mathematics, which are simply stylistic variations on some letter or digit. Or, in some cases, on several letters or digits, such as ¼ or ℆. In particular, this includes commonly-used symbols written with font variants which have particular mathematical or other meanings, such as ℒ (the Laplace transform) and ℚ (the set of rational numbers). Canonical decomposition will strip out the stylistic information, which reduces those four examples to '1/4', 'c/u', 'L' and 'Q', respectively.
The first published Unicode standard defined a block of Letter-like symbols block in the Basic Multilingula Plane (BMP). (All of the above examples are drawn from that block.) In Unicode 3.1, complete Latin and Greek alphabets and digits were added in the Mathematical Alphanumeric Symbols block, which includes 13 different font variants of the 52 upper- and lower-case letters of the roman alphabet (lower and upper case), 58 greek letters in five font variants (some of which could pass for roman letters, such as 𝝪 which is upsilon, not capital Y), and the 10 digits in five variants (𝟎 𝟙 𝟤 𝟯 𝟺). And a few loose characters which mathematicians apparently asked for.
None of these should be used outside of mathematical typography, but that's not a constraint which most users of social networks care about. So people compensate for the lack of styled text in Twitter (and elsewhere) by using these Unicode characters, despite the fact that they are not properly rendered on all devices, make life difficult for screen readers, cannot readily be searched, and all the other disadvantages of used hacked typography, such as the issue you are running into. (Some of the rendering problems are also visible in your screenshot.)
Compatibility decomposition can go a long way in resolving the problem, but it also tends to erase information which is really useful. For example, x² and H₂O become just x2 and H2O, which might or might not be what you wanted. But it's probably the best you can do.
I was just parsing the following website.
There one finds the text
und wären damit auch
At first, the "ä" looks perfectly fine, but once I inspect it, it turns out that this is not the regular "ä" (represented as ascw 228) but this:
ascw: 97, char: a
ascw: 776, char: ¨
I have never before seen an "ä" represented like this.
How can it happen that a website uses this weird character combination and what might be the benefit from it?
What you don't mention in your questions is the used encoding. Quite obviously it is a Unicode based encoding.
In Unicode, code point U+0308 (776 in decimal) is the combining diaeresis. Out of the letter a and the diaeresis, the German character ä is created.
There are indeed two ways to represent German characters with umlauts (ä in this case). Either as a single code point:
U+00E4 latin small letter A with diaeresis
Or as a sequence of two code points:
U+0061 latin small letter A
U+0308 combining diaeresis
Similarly you would combine two code points for an upper case 'Ä':
U+0041 latin capital letter A
U+0308 combining diaeresis
In most cases, Unicode works with two codes points as it requires fewer code points to enable a wide range characters with diacritics. However for historical reasons a special code point exist for letters with German umlauts and French accents.
The Unicode libraries is most programming languages provide functions to normalize a string, i.e. to either convert all sequences into a single code point if possible or extend all single code points into the two code point sequence. Also see Unicode Normalization Forms.
Oh my, this was the answer or original problem with the name of a fileupload.
Cannot convert argument 2 to ByteString because the character at index 6 has value 776 which is greater than 255
For future references.
In the swift documentation for comparing strings, I found the following:
Two String values (or two Character values) are considered equal if
their extended grapheme clusters are canonically equivalent. Extended
grapheme clusters are canonically equivalent if they have the same
linguistic meaning and appearance, even if they are composed from
different Unicode scalars behind the scenes.
Then the documentation proceeds with the following example which shows two strings that are "cannonically equivalent"
For example, LATIN SMALL LETTER E WITH ACUTE (U+00E9) is canonically
equivalent to LATIN SMALL LETTER E (U+0065) followed by COMBINING
ACUTE ACCENT (U+0301). Both of these extended grapheme clusters are
valid ways to represent the character é, and so they are considered to
be canonically equivalent:
Ok. Somehow e and é look the same and also have the same linguistic meaning. Sure I'll give them that. I have taken a Spanish class sometime and the prof wasn't too strict on whether we used either forms of e, so I'm guessing this is what they are referring to. Fair enough
The documentation goes further to show two strings that are not canonically equivalent:
Conversely, LATIN CAPITAL LETTER A (U+0041, or "A"), as used in
English, is not equivalent to CYRILLIC CAPITAL LETTER A (U+0410, or
"А"), as used in Russian. The characters are visually similar, but do
not have the same linguistic meaning:
Now here is where the alarm bells go off and I decide to ask this question. It seems that appearance has nothing to do with it because the two strings look exactly the same, and they also admit this in the documentation. So it seems that what the string class is really looking for is linguistic meaning?
This is why I ask what it means by the strings having the same/different linguistic meaning, because e is the only form of e that I know which is mainly used in English, but I have only seen é being used in languages like French or Spanish, so why is it that the given that А is used in Russian and A is used in English, is what causes the string class to say that they are not equivalent?
I hope I was able to walk you through my thought process, now my question is what does it mean for two strings to have the same linguistic meaning (in code if possible)?
You said:
Somehow e and é look the same and also have the same linguistic meaning.
No. You have misread the document. Here's the document again:
LATIN SMALL LETTER E WITH ACUTE (U+00E9) is canonically equivalent to LATIN SMALL LETTER E (U+0065) followed by COMBINING ACUTE ACCENT (U+0301).
Here's U+00E9: é
Here's U+0065: e
Here's U+0301: ´
Here's U+0065 followed by U+0301: é
So U+00E9 (é) looks and means the same as U+0065 U+0301 (é). Therefore they must be treated as equal.
So why is Cyrillic А different from Latin A? UTN #26 gives several reasons. Here are some:
“Traditional graphology has always treated them as distinct scripts, …”
“Literate users of Latin, Greek, and Cyrillic alphabets do not have cultural conventions of treating each other's alphabets and letters as part of their own writing systems.”
“Even more significantly, from the point of view of the problem of character encoding for digital textual representation in information technology, the preexisting identification of Latin, Greek, and Cyrillic as distinct scripts was carried over into character encoding, from the very earliest instances of such encodings.”
“[A] unified encoding of Latin, Greek, and Cyrillic would make casing operations an unholy mess, …”
Read the tech note for full details.
Why special character( except underscore) are not allowed in variable name of programming language ?
Is there are any reason related to computer architecture or organisation.
Most languages have long histories, using ASCII (or EBCDIC) character sets. Those languages tend to have simple identifier descriptions (e.g., starts with A-Z, followed by A-Z,0-9, maybe underscore; COBOL allows "-" as part of a name). When all you had was an 029 keypunch or a teletype, you didn't have many other characters, and most of them got used as operator syntax or punctuation.
On older machines, this did have the advantage that you could encode an identifier as a radix 37 (A-Z,0-9, null) [6 characters in 32 bits] or radix 64 (A-Z,a-z,0-9,underscore and null) numbers [6 characters in 36 bits, a common word size in earlier generations of machines) for small symbol tables. A consequence: many older languages had 6 character limits on identifier sizes (e.g., FORTRAN).
LISP languages have long been much more permissive; names can be anything but characters with special meaning to LISP, e.g., ( ) [ ] ' ` #, and usually there are ways to insert these characters into names using some kind of escape convention. Our PARLANSE language is like LISP; it uses "~" as an escape, so you can write ~(begin+~)end as a single identifier whose actual spelling is "(begin+end)".
More modern languages (Java, C#, Scala, ...., uh, even PARLANSE) grew up in an era of Unicode, and tend to allow most of unicode in identifiers (actually, they tend to allow named Unicode subsets as parts of identifiers). An identifier made of chinese characters is perfectly legal in such languages.
Its kind of a matter of taste in the Western hemisphere: most identifier names still tend to use just letters and digits (sometimes, Western European letters). I don't know what the Japanese and Chinese really use for identifier names when they have Unicode capable character sets; what little Asian code I have seen tends to follow western identifier conventions but the comments tend to use much more of the local native and/or Unicode character set.
Fundamentally it is because they're mostly used as operators or separators, so it would introduce ambiguity.
Is there any reason relate to computer architecture or organization.
No. The computer can't see the variable names. Only the compiler can. But it has to be able to distinguish a variable name from two variable names separated by an operator, and most language designers have adopted the principle that the meaning of a computer program should not be affected by white space.
In the Dragonbook's exercise 3.3.1 the student should
Consult the language reference manuals
to determine (i) the set of characters
that form the input alphabet
(excluding those that may only appear
in character strings or comments [...]
for each of the following languages:
[...].
It makes no real sense to me to describe really all the characters like a, b, / for a language, even if it is an exercise for compilers. Isn't the alphabet of a programming language the set of possible words, like {id, int, float, string, if, for, ... }?
And if you consider it really beeing "characters" in the basic idea of the word, is ??/ in C one or three charaters (or both)?
The alphabet of a language is the set of characters not the words.
Isn't the alphabet of a programming
language the set of possible words,
like {id, int, float, string, if, for,
... }?
No, the alphabet is the set of characters that are used to form words. When an language is specified, the alphabet must be given otherwise you cannot distinguish a valid token from an invalid token.
Update
You are confusing the term "word" with "token". A word is not some part of a language or program. A word is finite string of characters from the alphabet. It has nothing to do with a language construct like "int" or "while". For example, each C program is a word because it is a finite string of characters from the alphabet. The set of all of these programs (words) forms the C programming language. Tokens like "void" or "int" are entirely a different thing.
To recap, you start by defining the some set of characters you want to use. This is called the alphabet. Finite strings of these characters form words. A language is some subset of all possible words. To define a language, you define which words belong to the language. For example, with a regular expression or a context-free grammar.
Wikipedia has a good page on formal languages.
http://en.wikipedia.org/wiki/Formal_language
The confusion comes from theory defining alphabet as the set of symbols from which the strings in a language are formed. Note that the grammars for programming languages use tokens and not characters as terminal symbols.
Traditionally, from the perspective of language theory, programming languages involve two language definitions: 1) The one that has characters as the alphabet and tokens as the valid strings. 2) The one that has tokens as the alphabet and programs as the valid strings. That's why programming languages are usually specified in two parts, a lexical, and a syntactical analyzer.
It is not strictly necessary to have the two definitions to parse a programming language. A single grammar can be used to specify a programming language using characters as the input alphabet. It's just that the characters-to-token parts has been easier to specify with regular expressions, and the tokens-to-program part with grammars.
Modern compiler-compilers like ANTLR use grammar-specification languages that incorporate the expressive convenience of regular expressions, so a character-to-program definition can be done with a single grammar. Still, separating the lexical from the syntactical remains the most convenient way to parse a programming language, even with such tools.
Last minute example: imagine that the grammar productions for an if-then-else-end had to deal at the character-level with:
Whitespace.
Keywords within programming language strings: "Then, the end."
Variable names that contain keywords: 'tiff',
...
It can be done, but it would be extremely complicated.