Why does the unicode allows symbols to go beyond the frame - text

For example, we have such Javanese symbols (꧁,꧅, ꧂ and a few other):
test
꧅test
which are touching the next line and go beyond the symbol frame. Unlike the Zalgo characters, which consists of symbol and a lot of combined diacritics, our simbols are single ones.
My question is: why is it so, why unicode allows single simbols to do that and is there a list of such symbols (or is it a special property responsible for this).

Related

Is there a module or regex in Python to convert all fonts to a uniform font? (Text is coming from Twitter)

I'm working with some text from twitter, using Tweepy. All that is fine, and at the moment I'm just looking to start with some basic frequency counts for words. However, I'm running into an issue where the ability of users to use different fonts for their tweets is making it look like some words are their own unique word, when in reality they're words that have already been encountered but in a different font/font size, like in the picture below (those words are words that were counted previously and appear in the spreadsheet earlier up).
This messes up the accuracy of the counts. I'm wondering if there's a package or general solution to make all the words a uniform font/size - either while I'm tokenizing it (just by hand, not using a module) or while writing it to the csv (using the csv module). Or any other solutions for this that I may not be considering. Thanks!
You can (mostly) solve your problem by normalising your input, using unicodedata.normalize('NFKC', str).
The KC normalization form (which is what NF stands for) first does a "compatibility decomposition" on the text, which replaces Unicode characters which represent style variants, and then does a canonical composition on the result, so that ñ, which is converted to an n and a separate ~ diacritic by the decomposition, is then turned back into an ñ, the canonical composite for that character. (If you don't want the recomposition step, use NFKD normalisation.) See Unicode Annex 15 for a more precise description, with examples.
Unicode contains a number of symbols, mostly used for mathematics, which are simply stylistic variations on some letter or digit. Or, in some cases, on several letters or digits, such as ¼ or ℆. In particular, this includes commonly-used symbols written with font variants which have particular mathematical or other meanings, such as ℒ (the Laplace transform) and ℚ (the set of rational numbers). Canonical decomposition will strip out the stylistic information, which reduces those four examples to '1/4', 'c/u', 'L' and 'Q', respectively.
The first published Unicode standard defined a block of Letter-like symbols block in the Basic Multilingula Plane (BMP). (All of the above examples are drawn from that block.) In Unicode 3.1, complete Latin and Greek alphabets and digits were added in the Mathematical Alphanumeric Symbols block, which includes 13 different font variants of the 52 upper- and lower-case letters of the roman alphabet (lower and upper case), 58 greek letters in five font variants (some of which could pass for roman letters, such as 𝝪 which is upsilon, not capital Y), and the 10 digits in five variants (𝟎 𝟙 𝟤 𝟯 𝟺). And a few loose characters which mathematicians apparently asked for.
None of these should be used outside of mathematical typography, but that's not a constraint which most users of social networks care about. So people compensate for the lack of styled text in Twitter (and elsewhere) by using these Unicode characters, despite the fact that they are not properly rendered on all devices, make life difficult for screen readers, cannot readily be searched, and all the other disadvantages of used hacked typography, such as the issue you are running into. (Some of the rendering problems are also visible in your screenshot.)
Compatibility decomposition can go a long way in resolving the problem, but it also tends to erase information which is really useful. For example, x² and H₂O become just x2 and H2O, which might or might not be what you wanted. But it's probably the best you can do.

Read substrings from a string containing multiplication [duplicate]

This question already has answers here:
'*' and '/' not recognized on input by a read statement
(2 answers)
Closed 4 years ago.
I am a scientist programming in Fortran, and I came up with a strange behaviour. In one of my programs I have a string containing several "words", and I want to read all words as substrings. The first word starts with an integer and a wildcard, like "2*something".
When I perform an internal read on that string, I expect to read all wods, but instead, the READ function repeatedly reads the first substring. I do not understand why, nor how to avoid this behaviour.
Below is a minimalist sample program that reproduces this behaviour. I would expect it to read the three substrings and to print "3*a b c" on the screen. Instead, I get "a a a".
What am I doing wrong? Can you please help me and explain what is going on?
I am compiling my programs under GNU/Linux x64 with Gfortran 7.3 (7.3.0-27ubuntu1~18.04).
PROGRAM testread
IMPLICIT NONE
CHARACTER(LEN=1024):: string
CHARACTER(LEN=16):: v1, v2, v3
string="3*a b c"
READ(string,*) v1, v2, v3
PRINT*, v1, v2, v3
END PROGRAM testread
You are using list-directed input (the * format specifier). In list-directed input, a number (n) followed by an asterisk means "repeat this item n times", so it is processed as if the input was a a a b c. You would need to have as input '3*a' b c to get what you want.
I will use this as another opportunity to point out that list-directed I/O is sometimes the wrong choice as its inherent flexibility may not be what you want. That it has rules for things like repeat counts, null values, and undelimited strings is often a surprise to programmers. I also often see programmers complaining that list-directed input did not give an error when expected, because the compiler had an extension or the programmer didn't understand just how liberal the feature can be.
I suggest you pick up a Fortran language reference and carefully read the section on list-directed I/O. You may find you need to use an explicit format or change your program's expectations.
Following the answer of #SteveLionel, here is the relevant part of the reference on list-directed sequential READ statements (in this case, for Intel Fortran, but you could find it for your specific compiler and it won't be much different).
A character string does not need delimiting apostrophes or quotation marks if the corresponding I/O list item is of type default character, and the following is true:
The character string does not contain a blank, comma (,), or slash ( / ).
The character string is not continued across a record boundary.
The first nonblank character in the string is not an apostrophe or a quotation mark.
The leading character is not a string of digits followed by an asterisk.
A nondelimited character string is terminated by the first blank, comma, slash, or end-of-record encountered. Apostrophes and quotation marks within nondelimited character strings are transferred as is.
In total, there are 4 forms of sequential read statements in Fortran, and you may choose the option that best fits your need:
Formatted Sequential Read:
To use this you change the * to an actual format specifier. If you know the length of the strings at advance, this would be as easy as '(a3,a2,a2)'. Or, you could come with a format specifier that matches your data, but this generally demands you knowing the length or format of stuff.
Formatted Sequential List-Directed:
You are currently using this option (the * format descriptor). As we already showed you, this kind of I/O comes with a lot of magic and surprising behavior. What is hitting you is the n*cte thing, that is interpreted as n repetitions of cte literal.
As said by Steve Lionel, you could put quotation marks around the problematic word, so it will be parsed as one-piece. Or, as proposed by #evets, you could split or break your string using the intrinsics index or scan. Another option could be changing your wildcard from asterisk to anything else.
Formatted Namelist:
Well, that could be an option if your data was (or could be) presented in the namelist format, but I really think it's not your case.
Unformatted:
This may not apply to your case because you are reading from a character variable, and an internal READ statement can only be formatted.
Otherwise, you could split your string by means of a function instead of a I/O operation. There is no intrinsic for this, but you could come with one without much trouble (see this thread for reference). As you may have noted already, manipulating strings in fortran is... awkward, at least. There are some libraries out there (like this) that may be useful if you are doing lots of string stuff in Fortran.

Heuristic for using symbols rather than strings in J

While the real reason to use J's symbols (s: ' Abe Bill Chad') rather than string arrays ('Abe','Bill',:'Chad') or boxed lists of strings ('Abe';'Bill';'Chad') is that it is the best solution (most efficient/convenient for man or machine), what is the rule of thumb for when to use symbols?
The vocabulary page for s: mentions efficient "searching, sorting, and comparisons." Do you start using symbols from the beginning if there's any chance you'll be searching, sorting, or comparing? Or do you only use symbols once you recognize that your code is working around the limitations of the other options? Or something a bit more nuanced in between?
My experience is that for most use cases symbol (s:) is not necessary to provide acceptable performance which matches the advice in the section on Symbol in J for C Programmers that you use symbols if you find your program taking a lot of time matching strings. It also warns that:
there is no way to tell the interpreter to free the resources for a
single string; this can be a problem if your symbol table is large and
changes dynamically.
For this reason the Vocabulary page for Symbol on JWiki suggests symbols are most useful when:
a limited number of strings appear repeatedly
the set of symbols is known and unchanging

What's the difference between a character, a code point, a glyph and a grapheme?

Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other - is causing me trouble.
Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problem or Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.
The Unicode Consortium offers a glossary to explain this stuff, but it's full of "definitions" like this:
Abstract Character. A unit of information used for the organization, control, or representation of textual data. ...
...
Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ...
...
Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.
...
Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...
Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.
So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?
Character is an overloaded term that can mean many things.
A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.
A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8 bits, in UTF-16 this means 16 bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph (☃) is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.
A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaeresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).
A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may choose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.
Outside the Unicode standard a character is an individual unit of text composed of one or more graphemes. What the Unicode standard defines as "characters" is actually a mix of graphemes and characters. Unicode provides rules for the interpretation of juxtaposed graphemes as individual characters.
A Unicode code point is a unique number assigned to each Unicode character (which is either a character or a grapheme).
Unfortunately, the Unicode rules allow some juxtaposed graphemes to be interpreted as other graphemes that already have their own code points (precomposed forms). This means that there is more than one way in Unicode to represent a character. Unicode normalization addresses this issue.
A glyph is the visual representation of a character. A font provides a set of glyphs for a certain set of characters (not Unicode characters). For every character, there is an infinite number of possible glyphs.
A Reply to Mark Amery
First, as I stated, there is an infinite number of possible glyphs for each character so no, a character is not "always represented by a single glyph". Unicode doesn't concern itself much with glyphs, and the things it defines in its code charts are certainly not glyphs. The problem is that neither are they all characters. So what are they?
Which is the greater entity, the grapheme or the character? What does one call those graphic elements in text that are not letters or punctuation? One term that springs quickly to mind is "grapheme". It's a word that precisely conjure up the idea of "a graphical unit in a text". I offer this definition: A grapheme is the smallest distinct component in a written text.
One could go the other way and say that graphemes are composed of characters, but then they would be called "Chinese graphemes", and all those bits and pieces Chinese graphemes are composed of would have to be called "characters" instead. However, that's all backwards. Graphemes are the distinct little bits and pieces. Characters are more developed. The phrase "glyphs are composable", would be better stated in the context of Unicode as "characters are composable".
Unicode defines characters but it also defines graphemes that are to be composed with other graphemes or characters. Those monstrosities you composed are a fine example of this. If they catch on maybe they'll get their own code points in a later version of Unicode ;)
There's a recursive element to all this. At higher levels, graphemes become characters become graphemes, but it's graphemes all the way down.
A Reply to T S
Chapter 1 of the
standard states: "The Unicode character encoding treats alphabetic characters,
ideographic characters, and symbols equivalently, which means they can be used
in any mixture and with equal facility". Given this statement, we should be
prepared for some conflation of terms in the standard. Sometimes the proper
terminology only becomes clear in retrospect as a standard develops.
It often happens in formal definitions of a language that two fundamental
things are defined in terms of each other. For example, in
XML an element is defined as a starting tag
possibly followed by content, followed by an ending tag. Content is defined in
turn as either an element, character data, or a few other possible things. A
pattern of self-referential definitions is also implicit in the Unicode
standard:
A grapheme is a code point or a character.
A character is composed from a sequence of one or more graphemes.
When first confronted with these two definitions the reader might object to the
first definition on the grounds that a code point is a character, but
that's not always true. A sequence of two code points sometimes encodes a
single code point under
normalization, and that
encoded code point represents the character, as illustrated in
figure 2.7. Sequences of
code points that encode other code points. This is getting a little tricky and
we haven't even reached the layer where where character encoding schemes such
as UTF-8 are used to
encode code points into byte sequences.
In some contexts, for example a scholarly article on
diacritics, and individual
part of a character might show up in the text by itself. In that context, the
individual character part could be considered a character, so it makes sense
that the Unicode standard remain flexible as well.
As Mark Avery pointed out, a character can be composed into a more complex
thing. That is, each character can can serve as a grapheme if desired. The
final result of all composition is a thing that "the user thinks of as a
character". There doesn't seem to be any real resistance, either in the
standard or in this discussion, to the idea that at the highest level there are
these things in the text that the user thinks of as individual characters. To
avoid overloading that term, we can use "grapheme" in all cases where we want
to refer to parts used to compose a character.
At times the Unicode standard is all over the place with its terminology. For
example, Chapter 3
defines UTF-8 as an "encoding form" whereas the glossary defines "encoding
form" as something else, and UTF-8 as a "Character Encoding Scheme". Another
example is "Grapheme_Base" and "Grapheme_Extend", which are
acknowledged to be
mistakes but that persist because purging them is a bit of a task. There is
still work to be done to tighten up the terminology employed by the standard.
The Proposal for addition of COMBINING GRAPHEME
JOINER got it
wrong when it stated that "Graphemes are sequences of one or more encoded
characters that correspond to what users think of as characters." It should
instead read, "A sequence of one or more graphemes composes what the user
thinks of as a character." Then it could use the term "grapheme sequence"
distinctly from the term "character sequence". Both terms are useful.
"grapheme sequence" neatly implies the process of building up a character from
smaller pieces. "character sequence" means what we all typically intuit it to
mean: "A sequence of things the user thinks of as characters."
Sometimes a programmer really does want to operate at the level of grapheme
sequences, so mechanisms to inspect and manipulate those sequences should be
available, but generally, when processing text, it is sufficient to operate on
"character sequences" (what the user thinks of as a character) and let the
system manage the lower-level details.
In every case covered so far in this discussion, it's cleaner to use "grapheme"
to refer to the indivisible components and "character" to refer to the composed
entity. This usage also better reflects the long-established meanings of both
terms.

Is there a unicode range that is a copy of the first 128 characters?

I would like to be able to put and other characters into a text without it being interpreted by the computer. So was wondering is there a range that is defined as mapping to the same glyphs etc as the range 0-0x7f (the ascii range).
Please note I state that the range 0-0x7f is the same as ascii, so the question is not what range maps to ascii.
I am asking is there another range that also maps to the same glyphs. I.E when rendered will look the same. But when interpreted may be can be seen as a different code.
so I can write
print "hello "world""
characters in bold avoid the 0-0x7f (ascii range)
Additional:
I was meaning homographic and behaviourally, well everything the same except a different code point. I was hopping for the whole ascii/128bit set, directly mapped (an offset added to them all).
The reason: to avoid interpretation by any language that uses some of the ascii characters as part of its language but allows any unicode character in literal strings e.g. (when uft-8 encoded) C, html, css, …
I was trying to retro-fix the idea of “no reserved words” / “word colours” (string literals one colour, keywords another, variables another, numbers another, etc) so that a string literal or variable-name(though not in this case) can contain any character.
I interpret the question to mean "is there a set of code points which are homographic with the low 7-bit ASCII set". The answer is no.
There are some code points which are conventionally rendered homographically (e.g. Cyrillic upparcase А U+0410 looks identical to ASCII 65 in many fonts, and quite similar in most fonts which support this code point) but they are different code points with different semantics. Similarly, there are some code points which basically render identically, but have a specific set of semantics, like the non-breaking space U+00A0 which renders identically to ASCII 32 but is specified as having a particular line-breaking property; or the RIGHT SINGLE QUOTATION MARK U+2019 which is an unambiguous quotation mark, as opposed to its twin ASCII 39, the "apostrophe".
But in summary, there are many symbols in the basic ASCII block which do not coincide with a homograph in another code block. You might be able to find homographs or near-homographs for your sample sentence, though; I would investigate the IPA phonetic symbols and the Greek and Cyrillic blocks.
The answer to the question asked is “No”, as #tripleee described, but the following note might be relevant if the purpose is trickery or fun of some kind:
The printable ASCII characters excluding the space have been duplicated at U+FF01 to U+FF5E, but these are fullwidth characters intended for use in CJK texts. Their shape is (and is meant to be) different: hello  world. (Your browser may be unable to render them.) So they are not really homographic with ASCII characters but could be used for some special purposes. (I have no idea of what the purpose might be here.)
Depends on the Unicode standard you use.
In UTF-8, the first 128 characters have the exact ASCII counterparts as code numbers. In UTF-16, the first 128 ASCII characters are between 0x0000 and 0x007F (2 bytes).

Resources