Letter-only collation (was: Weird file ordering in Emacs dired with my locale) - locale

I just noticed. And this is creepy. But here's my screenshot. So help me, maybe!
TL;DR
The question's at the bottom.
Symptom
-rw-r--r-- 1 jb jb 24287 mars 21 2012 array.c
-rw-r--r-- 1 jb jb 28767 oct. 1 2014 arrayfunc.c
-rw-r--r-- 1 jb jb 2895 mai 11 2012 arrayfunc.h
-rw-rw-r-- 1 jb jb 4030 mars 29 2009 array.h
-UUU:%%--F1 bash-4.3.30 6% L9 (Dired by name)---------------------
 
(This is an emacs -nw screenshot. Yes, my terminal is 6 lines tall. It makes the screenshots more to-the-point. The locale is French, and that's expected. It's not that different to English, just imagine there's a “may” instead of « mai » and the months are Capitalized and truncated to three characters)
In case you missed it, it's dired mode, the files are supposed to be sorted by name (says so in the modeline) yet array.c and array.h aren't together!
Panic
I was looking for array.c, had the cursor beneath so whoa dude where is it it was there a minute ago. Then I actually find it. Then I check the modeline. Then I go WTF I'm asking SO. Then I notice it's in French they'll never understand better take a new screenshot with LC_ALL=C.
But that fixed the problem.
(Yes, it really happened.)
So it's a locale thing
My locale is fr_FR.UTF-8
$ ls ar* | $ LC_ALL=C ls ar*
array.c | array.c
arrayfunc.c | array.h
arrayfunc.h | arrayfunc.c
array.h | arrayfunc.h
(That's when I remove the emacs tag and start wondering if anyone actually follows collation seriously)
Seems it's the norm
I'll spare you the arcane shell invocations, but the gist of it: of the 29 locales I've got installed here, all but three use the “weird” ordering. Those three are: C, C.UTF-8 and POSIX.
It goes without saying, but there's no harm in mentioning it anyway: the “weird” ordering disturbs me, but it makes sense in its own way: on this small sample set it orders lexicographically as usual, only ignoring the period. So arrayc < arrayf < arrayh.
Question
Why? WHY? WHY??? It's in every locale but C, so it's deliberate. What rule is this based on? Did someone in some committee erect and convict: “thou shalt not observe thy punctuation whilst collating”? There's probably some legitimate serious document where they say it's perfectly normal, here's why, right?
It's the first time in oh so many years that I notice.
It also ignores spaces, of course.
Bonus: It's the bash-4.3.30 tarball from gnu.org. Why are some files 0664 and others 0644? Keep answers to that in the comments.
Also: I'm not asking how to fix it. In case you hadn't noticed, I already fixed don't really need to fix it. Plus, this has dupes everywhere. What I'm asking is why.

ANSWER: The Unicode Consortium came to the conclusion that having a guaranteed sort order, regardless of 'variable' characters, was more important than including every character in the string.
DETAILS: I believe the answer you're looking for resides in:
Unicode Technical Standard #10: Unicode Collation Algorithm
If I'm understanding it correctly, punctuation (among other things, like whitespace) is 'variable' among languages, and therefore to ensure an identical sort order across languages, 'variable' characters are given a very low 'weight' in sorting; frequently resolving to a weight of zero, and therefore having no effect on sorting at all.
The UTS does indicate that the sorting can be customized per user.
Unfortunately, most systems just go with the defaults, which leads to only a few collation definitions that give 'variable' characters equal weight; and no real support for users to tune the defaults so that they get UTF-8 sorting with punctuation and whitespace INCLUDED instead of EXCLUDED.
If I follow the rational correctly, consider sorting names. In many cultures and languages, firstname is always given before lastname, and when reversed, the lastname is separated by punctuation from the firstname. In other cultures, the reverse is true.
lastname, firstname
lastname firstname
and
firstname lastname
firstname, lastname
To ensure that each list is always sorted in the same order, the punctuation is ignored.

Related

my bashrc contains strange characters (if Ä -f ü/.bash_aliases Å; then . ü/.bash_aliases fi)

In GCP compute Linux Accidentally did cat filebeat instead of filebeat.yaml
after that my bashrc contains below chars and if I type '~' bash is printing 'ü'
Need help in fixing this
if Ä -f ü/.bash_aliases Å; then
. ü/.bash_aliases
fi
This looks like your terminal was accidentally configured for legacy ISO-646-SE or a variant. Your file is probably fine; it's just that your terminal remaps the display characters according to a scheme from the 1980s.
A quick hex dump should verify that the characters in the file are actually correct. Here's an example of what you should see.
bash$ echo '[\]' | xxd
00000000: 5b5c 5d0a [\].
Even if the characters are displayed as ÄÖÅ, they are correct if you see the hex codes 5B, 5C, and 5D. (If you don't have xxd, try hexdump or od -t x1.)
Probably
bash$ tput reset
can set your terminal back to sane settings. Maybe stty sane might work too (but less likely, in my experience). Else, try logging out and back in.
Back when ASCII was the only game in town, but American (or really any) hardware was exported to places where the character repertoire was insufficient, the local vendor would replace the ROM chips in terminals to remap some slightly less common character codes to be displayed as the missing local glyphs. Over time, this became standardized; the ISO-646 standard was updated to document these local overrides. (The linked Wikipedia page has a number of tables with details.)
Eventually, 8-bit character sets became the norm, and then most locales switched to Latin-1 or some other suitable character set which no longer needed this hack. However, it was still rather prevalent even in the early 1990s. In the early 2000s, Unicode started taking over, and so now this seems like an absurd arrangement.
I'm guessing the file you happened to cat contained some control characters which instructed your terminal to switch to this legacy character set. It's not entirely uncommon (though usually when it happens to me, it switches to some "graphical" character set where some characters display box-drawing characters or mathematical symbols).

Why does sublime consider <!------- (multiple dashes) a syntax error

I have a .html file that is working perfectly fine but for some reason Sublime 3 decides that it has invalid code, check the image below:
Any idea why that's happening and how to fix it without having to modify the code?
The HTML5 spec states (my emphasis):
Comments must start with the four character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--). Following this sequence, the comment may have text, with the additional restriction that the text must not start with a single > (U+003E) character, nor start with a U+002D HYPHEN-MINUS character (-) followed by a > (U+003E) character,
nor contain two consecutive U+002D HYPHEN-MINUS characters (--),
nor end with a U+002D HYPHEN-MINUS character (-). Finally, the comment must be ended by the three character sequence U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN (-->).
So that's why it's complaining. As to how to fix it without changing the code, that's trickier.
Your contention that it works is no different really to C developers wondering why they need to worry about undefined behaviour because the code they wrote works fine. The fact that it works fine in one particular implementation is not relevant to portable code.
My advice is to actually change the code. It's not valid, after all, and any browser (current or future) would be well within its rights to simply reject it.
As an aside after some historical digging, it appears this is not allowed because SGML, on which HTML was based, had slightly different rules regarding comment.
On sensing the <!-- token, the parser was switched to a comment mode where > characters were actually allowed within the comment. If the -- sequence was encountered, it changed to a different mode where the > would end the comment.
In fact, it appears to have been a toggle switch between those two modes, so something like <!-- >>>>> -- xyzzy -- >>>>> --> was possible, but putting a > where the xyzzy would end the comment.
XML, for one, didn't adopt this behaviour and HTML has now modified it to follow the "don't use -- within comments at all" rule, the reason being that hardly anyone knew that the comments behaved in the SGML way, causing some pain :-)

Struggling to reproduce terminal from The Unix Programming Environment (1983)

I have been reading The Unix Programming Environment & performing the included exercises. I understand that this work is somewhat dated, but I have found it to be an excellent resource.
In the first chapter, there are a few exercises in which the reader is presented with an interaction with the terminal & is asked to explain the interaction. Here is an example:
Exercise 1-1. Explain what happens with
$ date\#
In the text, it is explained that an # is to be interpreted as the line kill character. The equivalent on my system is ^u, but I can emulate the terminal in the book with stty kill #.
Based on the reading & my intuition, I would expect the invocation of date\# to return something to the effect of:
date#: command not found
The text supports this reasoning:
If you precede either # or # by a backslash \, it loses its special meaning. So to enter a # or #, type \# or \#.
My problem is that I cannot even type the example into my terminal. As soon as I type #, the line is erased. The backslash does not appear to escape the line kill character.
Assuming I am correct about how the escape character should interact with terminal control characters, how can I set up my system (Ubuntu GNU/Linux) to emulate the behavior from the text?
Here is another similar exercise:
Exercise 1-2. Most shells (though not the 7th edition shell) interpret # as introducing a comment, and ignore all text from the # to the end of the line. Given this, explain the following transcript, assuming your erase character is also #:
$ date
Mon Sep 26 12:39:56 EDT 1983
$ #date
Mon Sep 26 12:40:21 EDT 1983
$ \#date
$ \\#date
#date: not found
$
With my erase character set to #, it is impossible to replicate this transcript. The backslash does not appear to escape the erase character.
The Terminal gets and responds to your keystrokes before the Shell does. So the shell has no chance to escape the # since the terminal deletes the whole line first.
When you typed
stty kill #
you told the shell to tell the terminal to kill the line every time you press #
Type
stty kill ^u
and your shell will start to behave the way you expect and ^u will kill lines for you.
^v is the escape char for the terminal
\ is the escape char for the shell.
This is an antique question, about an even more antique book, but I'd like to set the record straight here because the currently accepted answer did not answer your question.
Believe it or not, when I learned UNIX from this book in 1985 (!), this part of the book was already antiquated, and the stuff about "#", "#" and "\" already did not work, and I remember being puzzled exactly like you on why it doesn't work, and whether I was doing something wrong. But it wasn't wrong per se - just out of date. Let me explain how in a previous era (perhaps a decade before the book was published?) this stuff was correct:
Before the advent of CRT terminals, there were "teletype" terminals - basically typewriters which print the characters you type (and the remote responses) on paper. On such teletypes, there was no "backspace". You couldn't erase something already typed. So the convention was that you typed a "#", and it erased, logically, the previous character. You'd still see both of them on the paper, but had to imagine both were deleted from the computer's input. So if you see on paper
helk#lo world
The computer actually received "hello world", with the "#" deleting the "k" behind it.
UNIX also allowed you to type one character, "#", to delete the entire line you just typed, if you made a lot of mistakes. So
oops I wrote a lot of crap I need to erase#hello world
Was again interpreted as just "hello world".
Finally, since sometimes you wanted to type an actual "#" or "#" characters and have them be taken literally, not as character-erase or line-erase commands, you also had an "escape character", which in very early days was "\". Note that this escape character was interpreted not by the shell, but rather by the Unix kernel's terminal driver, which communicated with the teletype.
When new CRT terminals appeared, these conventions were quickly phased out and became the ones we know today: The default erase character was no longer "#" but rather the backspace or delete key, and it really erases the character on the screen. The line-erase (somewhat confusingly known as "kill") became control-X or control-U. The escape character became control-V. You can also change these characters with the "stty" command, setting the "erase", "kill", or "lnext" attributes, but people rarely do. "stty -a" shows you all the current settings of these special characters (and many more).

Which Letter takes up the most EM (globally)?

I was reading up on changing placeholder text when I stumbled across this question.
I went back and learnt about placeholders, anyway. And one SO answer said something along the lines of:
Be careful when designing your placeholder text, since anything outside of the control will be cut off.
Putting these two answer together, it made me think (yes, I know, bad thing to do!) -
What is the longest letter in EM in Global (language) terms?
(since we are meant to size letters in EM and all).
The longest in the English Alphabet is 'W' apparently (from linked Question) - so in terms of global languages, what is?
If I had a control such like:
+------------------------+
|123456789101112131415161|
+------------------------+
where the placeholder was 24 numbers long. How can i ensure they all fit?
Since numbers seem to be the same EM width:
11111
22222
33333
44444
55555
66666
77777
88888
99999
How can I ensure that 24 characters, no matter what length/EM width will fit?
I could just go:
+------------------------+
|WWWWWWWWWWWWWWWWWWWWWWWW|
+------------------------+
But what if there is a wider letter used from another language? How can I ensure that the placeholder text can be read? (without resizing the input itself dynamically)? I literally want the minimum width it would have to be to display 24 characters, no more - no matter what language is placed in the field.
Here's an example of an even longer 'letter' than English's W:
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
ŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒ
ÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆ
EDIT
I know how i would test (as above) but not 'contenders' as to which is the widest character in the world?
determine your font (eg. Arial)
determine your font-size (eg. 10px)
determine your font-weight (eg. bold)
determine your character-set (eg. UTF-8)
print a couple of same characters (eg. 24) per row for each character of the character set
devide and conquer
-> remove rows that are obviously shorter than others and refresh the page
-> repeat removal as long as there are more than one rows on the page (for equally long rows just pick any one)
Print each character enclosed by a span and then search for the calculated width for that character (either in your browser's devtools or you can automate this with a simple Javascript script that will check the largest of your characters).

String ordering in Lua

I'm reading Programming in Lua, 1st edition (yup, I know it's a bit outdated), and in the section 3.2 (about relational operators), the author says:
For instance, with the European Latin-1 locale, we have "acai" < "açaí" < "acorde".
I don't get it. For me, it's OK to have "acai" < "açaí", but why is "açaí" < "acorde"?
AFAIK (and wikipedia seems to confirm), "c" < "ç", or am I wrong?
In the third edition of PiL, this statement has been modified:
For instance, with a Portuguese Latin-1 locale, we have "acai"<"açaí"<"acorde".
So the locale needs to be set to Portuguese Latin-1 accordingly:
print("acai" < "açaí")
print("açaí" < "acorde")
print(os.setlocale("pt_PT"))
print("acai" < "açaí")
print("açaí" < "acorde")
On ideone, the result is:
true
false
pt_PT.iso88591
false
true
But the order of "acai" and "açaí" seems to be different from the book now.
You reference a code page, which maps codepoints to characters. Certainly codepoints, being a finite set of non-negative integers, are well-ordered, distinct entities. However, that is not what characters are about.
Characters have a collation order, which is a partial ordering: Characters can be "equal" but not the same. Collation is a user-valued concept that varies by locale (and over time).
Strings are even more complicated because some character sets (e.g. Unicode) can have combining characters. That allows a "character" to be represented as a single character or as a base character followed by the combining characters. For example, "ä" vs "a¨". Since they represent the same conceptual character they should be considered even more equal than "ä" vs "a".
In Spanish, "ch", "rr" and "ll" used to be letters in the alphabet and words were ordered accordingly; Now, they are not but "ñ" still is.
Similarly, in the past it was not uncommon for English-speakers to sort surnames beginning with "Mc" and "Mac" after others beginning with "M".
Software libraries have to deal with such things because that's what users want. Thankfully, some of the older conventions have fallen from use.
So, a locale could very well have collation rules that result in "acai" < "açaí" < "acorde" if "c" has the same sort order as "ç" but "i" comes before "í". This case seems strange though the possibility in general requires our code to allow it.

Resources