What does it mean that two strings have the same linguistic meaning? - string

In the swift documentation for comparing strings, I found the following:
Two String values (or two Character values) are considered equal if
their extended grapheme clusters are canonically equivalent. Extended
grapheme clusters are canonically equivalent if they have the same
linguistic meaning and appearance, even if they are composed from
different Unicode scalars behind the scenes.
Then the documentation proceeds with the following example which shows two strings that are "cannonically equivalent"
For example, LATIN SMALL LETTER E WITH ACUTE (U+00E9) is canonically
equivalent to LATIN SMALL LETTER E (U+0065) followed by COMBINING
ACUTE ACCENT (U+0301). Both of these extended grapheme clusters are
valid ways to represent the character é, and so they are considered to
be canonically equivalent:
Ok. Somehow e and é look the same and also have the same linguistic meaning. Sure I'll give them that. I have taken a Spanish class sometime and the prof wasn't too strict on whether we used either forms of e, so I'm guessing this is what they are referring to. Fair enough
The documentation goes further to show two strings that are not canonically equivalent:
Conversely, LATIN CAPITAL LETTER A (U+0041, or "A"), as used in
English, is not equivalent to CYRILLIC CAPITAL LETTER A (U+0410, or
"А"), as used in Russian. The characters are visually similar, but do
not have the same linguistic meaning:
Now here is where the alarm bells go off and I decide to ask this question. It seems that appearance has nothing to do with it because the two strings look exactly the same, and they also admit this in the documentation. So it seems that what the string class is really looking for is linguistic meaning?
This is why I ask what it means by the strings having the same/different linguistic meaning, because e is the only form of e that I know which is mainly used in English, but I have only seen é being used in languages like French or Spanish, so why is it that the given that А is used in Russian and A is used in English, is what causes the string class to say that they are not equivalent?
I hope I was able to walk you through my thought process, now my question is what does it mean for two strings to have the same linguistic meaning (in code if possible)?

You said:
Somehow e and é look the same and also have the same linguistic meaning.
No. You have misread the document. Here's the document again:
LATIN SMALL LETTER E WITH ACUTE (U+00E9) is canonically equivalent to LATIN SMALL LETTER E (U+0065) followed by COMBINING ACUTE ACCENT (U+0301).
Here's U+00E9: é
Here's U+0065: e
Here's U+0301: ´
Here's U+0065 followed by U+0301: é
So U+00E9 (é) looks and means the same as U+0065 U+0301 (é). Therefore they must be treated as equal.
So why is Cyrillic А different from Latin A? UTN #26 gives several reasons. Here are some:
“Traditional graphology has always treated them as distinct scripts, …”
“Literate users of Latin, Greek, and Cyrillic alphabets do not have cultural conventions of treating each other's alphabets and letters as part of their own writing systems.”
“Even more significantly, from the point of view of the problem of character encoding for digital textual representation in information technology, the preexisting identification of Latin, Greek, and Cyrillic as distinct scripts was carried over into character encoding, from the very earliest instances of such encodings.”
“[A] unified encoding of Latin, Greek, and Cyrillic would make casing operations an unholy mess, …”
Read the tech note for full details.

Related

Is there a module or regex in Python to convert all fonts to a uniform font? (Text is coming from Twitter)

I'm working with some text from twitter, using Tweepy. All that is fine, and at the moment I'm just looking to start with some basic frequency counts for words. However, I'm running into an issue where the ability of users to use different fonts for their tweets is making it look like some words are their own unique word, when in reality they're words that have already been encountered but in a different font/font size, like in the picture below (those words are words that were counted previously and appear in the spreadsheet earlier up).
This messes up the accuracy of the counts. I'm wondering if there's a package or general solution to make all the words a uniform font/size - either while I'm tokenizing it (just by hand, not using a module) or while writing it to the csv (using the csv module). Or any other solutions for this that I may not be considering. Thanks!
You can (mostly) solve your problem by normalising your input, using unicodedata.normalize('NFKC', str).
The KC normalization form (which is what NF stands for) first does a "compatibility decomposition" on the text, which replaces Unicode characters which represent style variants, and then does a canonical composition on the result, so that ñ, which is converted to an n and a separate ~ diacritic by the decomposition, is then turned back into an ñ, the canonical composite for that character. (If you don't want the recomposition step, use NFKD normalisation.) See Unicode Annex 15 for a more precise description, with examples.
Unicode contains a number of symbols, mostly used for mathematics, which are simply stylistic variations on some letter or digit. Or, in some cases, on several letters or digits, such as ¼ or ℆. In particular, this includes commonly-used symbols written with font variants which have particular mathematical or other meanings, such as ℒ (the Laplace transform) and ℚ (the set of rational numbers). Canonical decomposition will strip out the stylistic information, which reduces those four examples to '1/4', 'c/u', 'L' and 'Q', respectively.
The first published Unicode standard defined a block of Letter-like symbols block in the Basic Multilingula Plane (BMP). (All of the above examples are drawn from that block.) In Unicode 3.1, complete Latin and Greek alphabets and digits were added in the Mathematical Alphanumeric Symbols block, which includes 13 different font variants of the 52 upper- and lower-case letters of the roman alphabet (lower and upper case), 58 greek letters in five font variants (some of which could pass for roman letters, such as 𝝪 which is upsilon, not capital Y), and the 10 digits in five variants (𝟎 𝟙 𝟤 𝟯 𝟺). And a few loose characters which mathematicians apparently asked for.
None of these should be used outside of mathematical typography, but that's not a constraint which most users of social networks care about. So people compensate for the lack of styled text in Twitter (and elsewhere) by using these Unicode characters, despite the fact that they are not properly rendered on all devices, make life difficult for screen readers, cannot readily be searched, and all the other disadvantages of used hacked typography, such as the issue you are running into. (Some of the rendering problems are also visible in your screenshot.)
Compatibility decomposition can go a long way in resolving the problem, but it also tends to erase information which is really useful. For example, x² and H₂O become just x2 and H2O, which might or might not be what you wanted. But it's probably the best you can do.

What is this crazy German character combination to represent an umlaut?

I was just parsing the following website.
There one finds the text
und wären damit auch
At first, the "ä" looks perfectly fine, but once I inspect it, it turns out that this is not the regular "ä" (represented as ascw 228) but this:
ascw: 97, char: a
ascw: 776, char: ¨
I have never before seen an "ä" represented like this.
How can it happen that a website uses this weird character combination and what might be the benefit from it?
What you don't mention in your questions is the used encoding. Quite obviously it is a Unicode based encoding.
In Unicode, code point U+0308 (776 in decimal) is the combining diaeresis. Out of the letter a and the diaeresis, the German character ä is created.
There are indeed two ways to represent German characters with umlauts (ä in this case). Either as a single code point:
U+00E4 latin small letter A with diaeresis
Or as a sequence of two code points:
U+0061 latin small letter A
U+0308 combining diaeresis
Similarly you would combine two code points for an upper case 'Ä':
U+0041 latin capital letter A
U+0308 combining diaeresis
In most cases, Unicode works with two codes points as it requires fewer code points to enable a wide range characters with diacritics. However for historical reasons a special code point exist for letters with German umlauts and French accents.
The Unicode libraries is most programming languages provide functions to normalize a string, i.e. to either convert all sequences into a single code point if possible or extend all single code points into the two code point sequence. Also see Unicode Normalization Forms.
Oh my, this was the answer or original problem with the name of a fileupload.
Cannot convert argument 2 to ByteString because the character at index 6 has value 776 which is greater than 255
For future references.

What's the difference between a character, a code point, a glyph and a grapheme?

Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other - is causing me trouble.
Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problem or Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.
The Unicode Consortium offers a glossary to explain this stuff, but it's full of "definitions" like this:
Abstract Character. A unit of information used for the organization, control, or representation of textual data. ...
...
Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ...
...
Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.
...
Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...
Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.
So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?
Character is an overloaded term that can mean many things.
A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.
A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8 bits, in UTF-16 this means 16 bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph (☃) is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.
A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaeresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).
A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may choose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.
Outside the Unicode standard a character is an individual unit of text composed of one or more graphemes. What the Unicode standard defines as "characters" is actually a mix of graphemes and characters. Unicode provides rules for the interpretation of juxtaposed graphemes as individual characters.
A Unicode code point is a unique number assigned to each Unicode character (which is either a character or a grapheme).
Unfortunately, the Unicode rules allow some juxtaposed graphemes to be interpreted as other graphemes that already have their own code points (precomposed forms). This means that there is more than one way in Unicode to represent a character. Unicode normalization addresses this issue.
A glyph is the visual representation of a character. A font provides a set of glyphs for a certain set of characters (not Unicode characters). For every character, there is an infinite number of possible glyphs.
A Reply to Mark Amery
First, as I stated, there is an infinite number of possible glyphs for each character so no, a character is not "always represented by a single glyph". Unicode doesn't concern itself much with glyphs, and the things it defines in its code charts are certainly not glyphs. The problem is that neither are they all characters. So what are they?
Which is the greater entity, the grapheme or the character? What does one call those graphic elements in text that are not letters or punctuation? One term that springs quickly to mind is "grapheme". It's a word that precisely conjure up the idea of "a graphical unit in a text". I offer this definition: A grapheme is the smallest distinct component in a written text.
One could go the other way and say that graphemes are composed of characters, but then they would be called "Chinese graphemes", and all those bits and pieces Chinese graphemes are composed of would have to be called "characters" instead. However, that's all backwards. Graphemes are the distinct little bits and pieces. Characters are more developed. The phrase "glyphs are composable", would be better stated in the context of Unicode as "characters are composable".
Unicode defines characters but it also defines graphemes that are to be composed with other graphemes or characters. Those monstrosities you composed are a fine example of this. If they catch on maybe they'll get their own code points in a later version of Unicode ;)
There's a recursive element to all this. At higher levels, graphemes become characters become graphemes, but it's graphemes all the way down.
A Reply to T S
Chapter 1 of the
standard states: "The Unicode character encoding treats alphabetic characters,
ideographic characters, and symbols equivalently, which means they can be used
in any mixture and with equal facility". Given this statement, we should be
prepared for some conflation of terms in the standard. Sometimes the proper
terminology only becomes clear in retrospect as a standard develops.
It often happens in formal definitions of a language that two fundamental
things are defined in terms of each other. For example, in
XML an element is defined as a starting tag
possibly followed by content, followed by an ending tag. Content is defined in
turn as either an element, character data, or a few other possible things. A
pattern of self-referential definitions is also implicit in the Unicode
standard:
A grapheme is a code point or a character.
A character is composed from a sequence of one or more graphemes.
When first confronted with these two definitions the reader might object to the
first definition on the grounds that a code point is a character, but
that's not always true. A sequence of two code points sometimes encodes a
single code point under
normalization, and that
encoded code point represents the character, as illustrated in
figure 2.7. Sequences of
code points that encode other code points. This is getting a little tricky and
we haven't even reached the layer where where character encoding schemes such
as UTF-8 are used to
encode code points into byte sequences.
In some contexts, for example a scholarly article on
diacritics, and individual
part of a character might show up in the text by itself. In that context, the
individual character part could be considered a character, so it makes sense
that the Unicode standard remain flexible as well.
As Mark Avery pointed out, a character can be composed into a more complex
thing. That is, each character can can serve as a grapheme if desired. The
final result of all composition is a thing that "the user thinks of as a
character". There doesn't seem to be any real resistance, either in the
standard or in this discussion, to the idea that at the highest level there are
these things in the text that the user thinks of as individual characters. To
avoid overloading that term, we can use "grapheme" in all cases where we want
to refer to parts used to compose a character.
At times the Unicode standard is all over the place with its terminology. For
example, Chapter 3
defines UTF-8 as an "encoding form" whereas the glossary defines "encoding
form" as something else, and UTF-8 as a "Character Encoding Scheme". Another
example is "Grapheme_Base" and "Grapheme_Extend", which are
acknowledged to be
mistakes but that persist because purging them is a bit of a task. There is
still work to be done to tighten up the terminology employed by the standard.
The Proposal for addition of COMBINING GRAPHEME
JOINER got it
wrong when it stated that "Graphemes are sequences of one or more encoded
characters that correspond to what users think of as characters." It should
instead read, "A sequence of one or more graphemes composes what the user
thinks of as a character." Then it could use the term "grapheme sequence"
distinctly from the term "character sequence". Both terms are useful.
"grapheme sequence" neatly implies the process of building up a character from
smaller pieces. "character sequence" means what we all typically intuit it to
mean: "A sequence of things the user thinks of as characters."
Sometimes a programmer really does want to operate at the level of grapheme
sequences, so mechanisms to inspect and manipulate those sequences should be
available, but generally, when processing text, it is sufficient to operate on
"character sequences" (what the user thinks of as a character) and let the
system manage the lower-level details.
In every case covered so far in this discussion, it's cleaner to use "grapheme"
to refer to the indivisible components and "character" to refer to the composed
entity. This usage also better reflects the long-established meanings of both
terms.

Haskell ['a'..'z'] for French

I wonder, if this
alph = ['a'..'z']
returns me
"abcdefghijklmnopqrstuvwxyz"
How can I return French alphabet then? Can I pass somehow a locale?
Update:
Well ) I know that English and French has the same letters. But my point is if they were not the same but starts with A and ends with Z. Would be nice to have human language range support.
At least some languages come with localizations support.
(just trying Haskell, reading a book)
Haskell Char values are not real characters, they are Unicode code points. In some other languages their native character type may represent other things like ASCII characters or "code page whatsitsnumber" characters, or even something selectable at runtime, but not in Haskell.
The range 'a'..'z' coincides with the English alphabet for historical reasons, both in Unicode and in ASCII, and also in character sets derived from ASCII such as ISO8859-X. There is no commonly supported coded character set where some contiguous range of codes coincides with the French alphabet. That is, if you count letters with diacritics as separate letters. The accepted practice seems to exclude letters with diacritics, so the French alphabet coincides with English, but this is not so for other Latin-derived alphabets.
In order to get most alphabets other than English, one needs to enumerate the characters explicitly by hand and not with any range expression. For some languages one even cannot use Char to represent all letters, as some of them need more than one code point, such as Hungarian "ly" or Spanish "ll" (before 2010) or Dutch "ij" (according to some authorities — there's no one commonly accepted definition).
No language that I know supports arbitrary human alphabets as range expressions out of the box.
While programming languages usually support sorting by the current locale (just search for collate on Hackage), there is no library I know that provides a list of alphabetic characters by locale.
Modern (Unicode) systems allowing for localized characters try to also allow many non-latin alphabets, and thus very many alphabetic characters.
Enumerating all alphabetic characters within Unicode gives over 40k characters:
GHCi> length $ filter Data.Char.isAlpha $
map Data.Char.chr [0..256*256]
48408
While I am aware of libraries allowing to construct alphabetic indices, I don't know about any Haskell binding for this feature.

Why do I need a tokenizer for each language? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
When processing text, why would one need a tokenizer specialized for the language?
Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?
Tokenization is the identification of linguistically meaningful units (LMU) from the surface text.
Chinese: 如果您在新加坡只能前往一间夜间娱乐场所,Zouk必然是您的不二之选。
English: If you only have time for one club in Singapore, then it simply has to be Zouk.
Indonesian: Jika Anda hanya memiliki waktu untuk satu klub di Singapura, pergilah ke Zouk.
Japanese: シンガポールで一つしかクラブに行く時間がなかったとしたら、このズークに行くべきです。
Korean: 싱가포르에서 클럽 한 군데밖에 갈시간이 없다면, Zouk를 선택하세요.
Vietnamese: Nếu bạn chỉ có thời gian ghé thăm một câu lạc bộ ở Singapore thì hãy đến Zouk.
Text Source: http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf
The tokenized version of the parallel text above should look like this:
For English, it's simple because each LMU is delimited/separated by whitespaces. However in other languages, it might not be the case. For most romanized languages, such as Indonesian, they have the same whitespace delimiter that can easily identify a LMU.
However, sometimes an LMU is a combination of two "words" separated by spaces. E.g. in the Vietnamese sentence above, you have to read thời_gian (it means time in English) as one token and not 2 tokens. Separating the two words into 2 tokens yields no LMU (e.g. http://vdict.com/th%E1%BB%9Di,2,0,0.html) or wrong LMU(s) (e.g. http://vdict.com/gian,2,0,0.html). Hence a proper Vietnamese tokenizer would output thời_gian as one token rather than thời and gian.
For some other languages, their orthographies might have no spaces to delimit "words" or "tokens", e.g. Chinese, Japanese and sometimes Korean. In that case, tokenization is necessary for computer to identify LMU. Often there are morphemes/inflections attached to an LMU, so sometimes a morphological analyzer is more useful than a tokenizer in Natural Language Processing.
Some languages, like Chinese, don't use whitespace to separate words at all.
Other languages will use punctuation differently - an apostrophe might or might not be a part of a word, for instance.
Case-folding rules vary from language to language.
Stopwords and stemming are different between languauges (though I guess I'm straying from tokenizer to analyzer here).
Edit by Bjerva: Additionally, many languages concatenate compound nouns. Whether this should be tokenised to several tokens or not can not be easily determined using only whitespace.
The question also implies "What is a word?" and can be quite task-specific (even disregarding multilinguality as one parameter). Here's my try of a subsuming answer:
(Missing) Spaces between words
Many languages do not put spaces in between words at all, and so the
basic word division algorithm of breaking on whitespace is of no use
at all. Such languages include major East-Asian languages/scripts,
such as Chinese, Japanese, and Thai. Ancient Greek was also written by
Ancient Greeks without word spaces. Spaces were introduced (together
with accent marks, etc.) by those who came afterwards. In such
languages, word segmentation is a much more major and challenging
task. (MANNI:1999, p. 129)
Compounds
German compound nouns are written as a single word, e.g.
"Kartellaufsichtsbehördenangestellter" (an employee at the "Anti-Trust
agency"), and compounds de facto are single words -- phonologically (cf. (MANNI:1999, p. 120)).
Their information-density, however, is high, and one may wish to
divide such a compound, or at least to be aware of the internal
structure of the word, and this becomes a limited word segmentation
task.(Ibidem)
There is also the special case of agglutinating languages; prepositions, possessive pronouns, ... 'attached' to the 'main' word; e.g. Finnish, Hungarian, Turkish in European domains.
Variant styles and codings
Variant coding of information of a certain semantic type E.g. local syntax for phone numbers, dates, ...:
[...] Even if one is not dealing with multilingual text, any
application dealing with text from different countries or written
according to different stylistic conventions has to be prepared to
deal with typographical differences. In particular, some items such as
phone numbers are clearly of one semantic sort, but can appear in many
formats. (MANNI:1999, p. 130)
Misc.
One major task is the disambiguation of periods (or interpunctuation in general) and other non-alpha(-numeric) symbols: if e.g. a period is part of the word, keep it that way, so we can distinguish Wash., an abbreviation for the state of Washington, from the capitalized form of the verb wash (MANNI:1999, p.129). Besides cases like this, handling contractions and hyphenation can also not be viewed as a cross-language standard case (even disregarding the missing whitespace-separator).
If one wants to handle multilingual contractions/"cliticons":
English: They‘re my father‘s cousins.
French: Montrez-le à l‘agent!
German: Ich hab‘s ins Haus gebracht. (in‘s is still a valid variant)
Since tokenization and sentence segmentation go hand in hand, they share the same (cross-language) problems. To whom it may concern/wants a general direction:
Kiss, Tibor and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics32(4), p. 485-525.
Palmer, D. and M. Hearst. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics, 23(2), p. 241-267.
Reynar, J. and A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. Proceedingsof the Fifth Conference on Applied Natural Language Processing, p. 16-19.
References
(MANNI:1999) Manning Ch. D., H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.

Resources