Which ASCII Characters are Obsolete? - text

My understanding is that the ASCII characters found in the range from 0x00 to 0x1f were included with Teletype machines in mind. In the modern era, many of them have become obsolete. I was curious as to which characters might still be found in a conventional string or file. From my experience programming in C, I thought those might be NUL, LF, TAB, and maybe EOT. I'm especially curious about BS and ESC, as I thought (similar to shift or control maybe) that those might be handled by the OS and never really printed or be included in a string. Any amount of insight would be appreciated!
Table for reference:

Out of the characters between hexadecimal 00 and 1F, the only ones you are likely to encounter frequently are NUL (0x00 = \0), TAB (0x09 = \t), CR (0x0D = \r), and LF (0x0A = \n). Of these, NUL is used in C-like languages as a string terminator, TAB is used as a tab character, and CR and LF are used at the end of a line. (Which one is used is a complicated situation; see the Wikipedia article Newline for details, including a history of how this came to be.)
The following additional characters are used when communicating with VT100-compatible terminal emulators, but are rarely found outside that context:
BEL (0x07 = \a), which causes a terminal to beep and/or flash.
BS (0x08 = \b), which is used to move the cursor left one position. (It is not sent when you press the backspace key; see below!)
SO and SI (0x0E and 0x0F), which are used to switch into certain special character sets.
ESC (0x1B = \e), which is sent when pressing the Escape key and various other function keys, and is additionally used to introduce escape sequences which control the terminal.
DEL (0x7F), which is sent when you press the backspace key.
The rest of the nonprintable ASCII characters are essentially unused.

"Backspace composition no longer works with typical modern digital displays or typesetting systems" Ref Backspace
Here's a related question: The backspace escape character in c unexpected behavior
Ref Unicode
Unicode and the ISO/IEC 10646 Universal Character Set (UCS) have a much wider array of characters and their various encoding forms have begun to supplant ISO/IEC 8859 and ASCII rapidly in many environments. While ASCII is limited to 128 characters, Unicode and the UCS support more characters by separating the concepts of unique identification (using natural numbers called code points) and encoding (to 8-, 16- or 32-bit binary formats, called UTF-8, UTF-16 and UTF-32).
To allow backward compatibility, the 128 ASCII and 256 ISO-8859-1 (Latin 1) characters are assigned Unicode/UCS code points that are the same as their codes in the earlier standards. Therefore, ASCII can be considered a 7-bit encoding scheme for a very small subset of Unicode/UCS, and ASCII (when prefixed with 0 as the eighth bit) is valid UTF-8.
Here's another Unicode using backspace what is the purpose of Unicode backspace u0008
Here's a good overview of c programming how to program for unicode and UTF-8
And finally here's (FSF.org) GNU implementation GNU libunistring manual
"This library provides functions for manipulating Unicode strings and for manipulating C strings according to the Unicode standard."

Related

Why do ANSI color escapes end in 'm' rather than ']'?

ANSI terminal color escapes can be done with \033[...m in most programming languages. (You may need to do \e or \x1b in some languages)
What has always seemed odd to me is how they start with \033[, but they end in m Is there some historical reason for this (perhaps ] was mapped to the slot that is now occupied by m in the ASCII table?) or is it an arbitrary character choice?
It's not completely arbitrary, but follows a scheme laid out by committees, and documented in ECMA-48 (the same as ISO 6429). Except for the initial Escape character, the succeeding characters are specified by ranges.
While the pair Escape[ is widely used (this is called the control sequence introducer CSI), there are other control sequences (such as Escape], the operating system command OSC). These sequences may have parameters, and a final byte.
In the question, using CSI, the m is a final byte, which happens to tell the terminal what the sequence is supposed to do. The parameters if given are a list of numbers. On the other hand, with OSC, the command-type is at the beginning, and the parameters are less constrained (they might be any string of printable characters).

What is extended 7-bit (or 8-bit) code?

I just started reading the ECMA-48 standard (ISO/IEC 6429), and have a question.
It says:
This Ecma Standard defines control functions and their coded representations for use in a 7-bit code, an extended 7-bit code, an 8-bit code or an extended 8-bit code.
What does the "extended" 7/8-bit code mean here?
ECMA-35 talks about these. These terms are key:
code extension: The techniques for the encoding of characters that are not included in the character set of a given code.
escape sequence: A string of bit combinations that is used for control purposes in code extension procedures. The first of these bit combinations represents the control function ESCAPE.
Character ESCAPE: ESCAPE is a control character used for code extension purposes. It causes the meaning of a limited number of the bit combinations following it in a CC-data-element to be changed. These bit combinations, together with the preceding bit combination that represents the ESC character, constitute an escape sequence.
Thus, what we have here is a system where you can switch encoding systems in the middle of your text: You can start a text using Latin-1 encoding, provide an escape sequence that switches to Latin-2, and continue your text. ECMA-35 talks about this in appendix A. Chapter 13 has more information about the structure of escape sequences.

Is there a unicode range that is a copy of the first 128 characters?

I would like to be able to put and other characters into a text without it being interpreted by the computer. So was wondering is there a range that is defined as mapping to the same glyphs etc as the range 0-0x7f (the ascii range).
Please note I state that the range 0-0x7f is the same as ascii, so the question is not what range maps to ascii.
I am asking is there another range that also maps to the same glyphs. I.E when rendered will look the same. But when interpreted may be can be seen as a different code.
so I can write
print "hello "world""
characters in bold avoid the 0-0x7f (ascii range)
Additional:
I was meaning homographic and behaviourally, well everything the same except a different code point. I was hopping for the whole ascii/128bit set, directly mapped (an offset added to them all).
The reason: to avoid interpretation by any language that uses some of the ascii characters as part of its language but allows any unicode character in literal strings e.g. (when uft-8 encoded) C, html, css, …
I was trying to retro-fix the idea of “no reserved words” / “word colours” (string literals one colour, keywords another, variables another, numbers another, etc) so that a string literal or variable-name(though not in this case) can contain any character.
I interpret the question to mean "is there a set of code points which are homographic with the low 7-bit ASCII set". The answer is no.
There are some code points which are conventionally rendered homographically (e.g. Cyrillic upparcase А U+0410 looks identical to ASCII 65 in many fonts, and quite similar in most fonts which support this code point) but they are different code points with different semantics. Similarly, there are some code points which basically render identically, but have a specific set of semantics, like the non-breaking space U+00A0 which renders identically to ASCII 32 but is specified as having a particular line-breaking property; or the RIGHT SINGLE QUOTATION MARK U+2019 which is an unambiguous quotation mark, as opposed to its twin ASCII 39, the "apostrophe".
But in summary, there are many symbols in the basic ASCII block which do not coincide with a homograph in another code block. You might be able to find homographs or near-homographs for your sample sentence, though; I would investigate the IPA phonetic symbols and the Greek and Cyrillic blocks.
The answer to the question asked is “No”, as #tripleee described, but the following note might be relevant if the purpose is trickery or fun of some kind:
The printable ASCII characters excluding the space have been duplicated at U+FF01 to U+FF5E, but these are fullwidth characters intended for use in CJK texts. Their shape is (and is meant to be) different: hello  world. (Your browser may be unable to render them.) So they are not really homographic with ASCII characters but could be used for some special purposes. (I have no idea of what the purpose might be here.)
Depends on the Unicode standard you use.
In UTF-8, the first 128 characters have the exact ASCII counterparts as code numbers. In UTF-16, the first 128 ASCII characters are between 0x0000 and 0x007F (2 bytes).

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows).
Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
EDIT 20140523: Also, watch Characters, Symbols and the Unicode Miracle by Tom Scott on YouTube - it's just under ten minutes, and a wonderful explanation of the brilliant 'hack' that is UTF-8
A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.
UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:
The most significant bit of a single-byte character is always 0.
The most significant bits of the first byte of a multi-byte sequence
determine the length of the sequence.
These most significant bits are 110
for two-byte sequences; 1110 for
three-byte sequences, and so on.
The remaining bytes in a multi-byte sequence have 10 as their two most
significant bits.
A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a
UTF-8 stream never looks like a UTF-16
stream starting with U+FEFF
(Byte-order mark)
The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.
UTF16 (UCS2)
Uses 2 bytes to 4 bytes for each symbol.
UTF32 (UCS4)
uses 4 bytes always for each symbol.
char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.
STL:
Both stl's std::wstring and std::string are not designed for
variable-length character encodings like UTF-8 and UTF-16.
How to implement:
Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)
Other great resources on character encoding:
tbray.org's Characters vs. Bytes
IANA character sets
www.cs.tut.fi's A tutorial on code issues
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (first mentioned by #Dylan Beattie)
Received wisdom suggests that Spolsky's article misses a couple of important points.
This article is recommended as being more complete:
The Unicode® Standard: A Technical Introduction
This article is also a good introduction: Unicode Basics
The latter in particular gives an overview of the character encoding forms and schemes for Unicode.
The various UTF standards are ways to encode "code points". A codepoint is the index into the Unicode charater set.
Another encoding is UCS2 which is allways 16bit, and thus doesn't support the full Unicode range.
Good to know is also that one codepoint isn't equal to one character. For example a character such as å can be represented both as a code point or as two code points one for the a and one for the ring.
Comparing two unicode strings thus requires normalization to get the canonical representation before comparison.
There is also the issue with fonts. There are two ways to handle fonts. Either you use a gigantic font with glyphs for all the Unicode characters you need (I think recent versions of Windows comes with one or two such fonts). Or you use som library capable of combining glyphs from various fonts dedicated to subsets of the Unicode standard.

Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment?

I know it is customary, but why? Are there real technical reasons why any other way would be a really bad idea or is it just based on the history of encoding and backwards compatibility? In addition, what are the dangers of not using UTF-8, but some other encoding (most notably, UTF-16)?
Edit : By interacting, I mostly mean the shell and libc.
Partly because the file systems expect NUL ('\0') bytes to terminate file names, so UTF-16 would not work well. You'd have to modify a lot of code to make that change.
As jonathan-leffler mentions, the prime issue is the ASCII null character. C traditionally expects a string to be null terminated. So standard C string functions will choke on any UTF-16 character containing a byte equivalent to an ASCII null (0x00). While you can certainly program with wide character support, UTF-16 is not a suitable external encoding of Unicode in filenames, text files, environment variables.
Furthermore, UTF-16 and UTF-32 have both big endian and little endian orientations. To deal with this, you'll either need external metadata like a MIME type, or a Byte Orientation Mark. It notes,
Where UTF-8 is used transparently in
8-bit environments, the use of a BOM
will interfere with any protocol or
file format that expects specific
ASCII characters at the beginning,
such as the use of "#!" of at the
beginning of Unix shell scripts.
The predecessor to UTF-16, which was called UCS-2 and didn't support surrogate pairs, had the same issues. UCS-2 should be avoided.
I believe it's mainly the backwards compatability that UTF8 gives with ASCII.
For an answer to the 'dangers' question, you need to specify what you mean by 'interacting'. Do you mean interacting with the shell, with libc, or with the kernel proper?
Modern Unixes use UTF-8, but this was not always true. On RHEL2 -- which is only a few years old -- the default is
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=The C/POSIX locale is expected to be a 7-bit ASCII-compatible encoding.
However, as Jonathan Leffler stated, any encoding which allows for NUL bytes within a character sequence is unworkable on Unix, as system APIs are locale-ignorant; strings are all assumed to be byte sequences terminated by \0.
I believe that when Microsoft started using a two byte encoding, characters above 0xffff had not been assigned, so using a two byte encoding meant that no-one had to worry about characters being different lengths.
Now that there are characters outside this range, so you'll have to deal with characters of different lengths anyway, why would anyone use UTF-16? I suspect Microsoft would make a different decision if they were desigining their unicode support today.
Yes, it's for compatibility reasons. UTF-8 is backwards comptable with ASCII. Linux/Unix were ASCII based, so it just made/makes sense.
I thought 7-bit ASCII was fine.
Seriously, Unicode is relatively new in the scheme of things, and UTF-8 is backward compatible with ASCII and uses less space (half) for typical files since it uses 1 to 4 bytes per code point (character), while UTF-16 uses either 2 or 4 bytes per code point (character).
UTF-16 is preferable for internal program usage because of the simpler widths. Its predecessor UCS-2 was exactly 2 bytes for every code point.
I think it's because programs that expect ASCII input won't be able to handle encodings such as UTF-16. For most characters (in the 0-255 range), those programs will see the high byte as a NUL / 0 char, which is used in many languages and systems to mark the end of a string. This doesn't happen in UTF-8, which was designed to avoid embedded NUL's and be byte-order agnostic.

Resources