Bare-minimum text sanitation

Bare-minimum text sanitation - string

In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?
I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:
The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
The range 0x7F-0x9F (more control characters)
Ranges of characters that can safely be accepted would be even better to know.
There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.

See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I'd definitely filter out for most web sites. It notably includes such characters as:
U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;
U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;
language override control codes that could also have scope outside of an element;
BOM.
Additionally, you'd want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.
The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.
The range 0x7F-0x9F (more control characters)
Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you'd not want them.

I suppose it depends on your purpose. In UTF-8, you could limit the user to the keyboard characters if that is your whim, which is 9,10,13,[32-126]. If you are using UTF-8, the 0x7f+ range signifies that you have a multi-byte Unicode character. In ASCII, 0x7f+ consists special display/format characters, and is localized to allow extensions depending on the language at the location.
Note that in UTF-8, the keyboard characters can differ depending on location, since users can input characters in their native language which will be outside the 0x00-0x7f range if their language doesn't use a Latin script without accents (Arabic, Chinese, Japanese, Greek, Crylic, etc.).
If you take a look here you can see what characters from UTF-8 will display.

Related

How to get unicode of characters from 55296 to 56319 in Excel

I generated a list of letters in excel, from character codes 1 to 66535.
I am trying to get back the unicode by using the function "UNICODE". However, excel return #VALUE! for character codes from 55296 to 56319.
Please advise if there are any other function that can return a proper unicodes.
Thank you.

The range you are listing is a special range in Unicode: surrogates.
So, they have Unicode code point, but the problem it is you cannot have them in a text: Windows uses UCS-2/UTF-16 as internal encoding, so there are no way you can put in text. Or better: you to have code points above 65535, Windows uses two surrogates, one in the range 0xD800-0xDBFF (high surrogate) and the second one 0xDC00-0xDFFF )low surrogate). By combining these two, you have all Unicode code points.
But so, you should never have a single surrogate (or a mismatch surrogate, e.g. a high surrogate not followed a low surrogate, or a low surrogate not preceded be a high surrogate).
So, just skip such codes. Or better use them correctly to have characters above 65535.
Note: you cannot have all Unicode characters only with one code point. many characters requires combining many code points (there is a whole category of "combining characters" in Unicode). E.g. the zero with a oblique line is rendered with two unicode characters: the normal zero, and a variant selector. Also accented characters are very limited (and often with just one accent per characters). And without going to more complex scripts.

Vim: Utf-8 ې character breaks displayed string

I have file that has hex content: db90 3031 46, which should be displayed in vim as "ې" followed by "01F", but what I noticed is that it is never displayed correctly. Then I noticed It is the same in other places like in terminal and browser I always get ې01F? Why is that? Just paste that in google and try yourself you will never be able to put "ې" and 0 as next character.

That's an Arabic character with right-to-left indicator, so you probably need to switch back to left-to-right mode, such as with U+200e.
The Unicode bidirectional stuff is rather complex - the behaviour you are seeing is probably caused by the fact that the Latin digits are marked EN = European number (a weak type), while letters such as F are marked L = left to right (a strong type).
Weak types are treated differently in the Unicode specification, such as with this quote which covers your particular case (my emphasis):
Problematic cases may occur when a right-to-left paragraph begins with left-to-right characters, or there are nested segments of different-direction text, or there are weak characters on directional boundaries. In these cases, embeddings or directional marks may be required to get the right display.
So your code point followed by a digit renders as "ې7" (I typed that 7 in after the Arabic character despite the fact it's showing up before it), while following it with a letter gives "ېX".
For what it's worth, the text "ې‎7" was generated here by inserting ‎ between the two characters, the HTML equivalent of the U+200e Unicode code point.
If you head on over to this UTF-8 codec site and enter %u06D0%u200e7 into the decoding section, you'll see that it comes out in your desired order (removing the %200e shows it in the order you're describing in your question).

Usable Unicode Ranges for Custom Text Process

I am working on a processor that parts texts into blocks with marks:
LOREM IPSUM SED AMED
will be parsed like:
{word:1}LOREM{/word:1}{space:2}
{word:3}IPSUM{/word:3}{space:4}
{word:5}SED{/word:5}{space:6}
{word:7}AMED{/word:7}
But I dont want to use "{word}" etc, because it causes processor down, because it is an string again... I need to mark like these:
\E002\0001 LOREM \E003\0001 \E004\0002
\E002\0003 IPSUM \E003\0004 \E004\0005
\E002\0006 SED \E003\0006 \E004\0007
\E002\0008 AMED \E003\0008
First \E002 means element type number, its last bit represent element's close. So element number increments with +2.
Second \0001 means element index for stacking.
I am just used \E002 irrelevantly for this example.
But \0001 also using in Unicode Range, and this leads me to where I start again...
So which unicode range can I use? \ff0000? or how can I solve this?
Thanks!

The Unicode Consortium thought of this. There is a range of Unicode code points that are meant to never represent a displayable character, but meta-codes instead:
Noncharacters are code points that are permanently reserved and will never have characters
assigned to them.
...
Tag characters were intended to support a general scheme for the internal tagging of text
streams in the absence of other mechanisms, such as markup languages. The use of tag
characters for language tagging is deprecated.
(http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf)
You should be able to use regular control characters as "private" tags, because these should never occur in proper strings. This would be the range from U+0000 to U+001F, excluding tab (U+0009), the common "returns" (U+000A and U+000D), and, for safety, U+0000 itself (some libraries do not like Null characters in the middle of strings).
Non-characters
Noncharacters are code points that are permanently reserved in the Unicode Standard for
internal use. They are not recommended for use in open interchange of Unicode text data.
You can use U+FEFF (which is currently officially defined as Not-A-Character), or U+FFFE and U+FFFF. There are several more "officially not-a-characters" defined, and you can be fairly sure they would not occur in regular text strings.
A few random sequences with predefined definitions, and so highly unlikely to occur in plain text strings are:
Specials: U+FFF0–U+FFF8
The nine unassigned Unicode code points in the range U+FFF0..U+FFF8 are reserved for
special character definitions.
Annotation Characters: U+FFF9–U+FFFB
An interlinear annotation consists of annotating text that is related to a sequence of annotated
characters. For all regular editing and text-processing algorithms, the annotated characters
are treated as part of the text stream. The annotating text is also part of the content,
but for all or some text processing, it does not form part of the main text stream.
Tag Characters: U+E0000–U+E007F
This block encodes a set of 95 special-use tag characters to enable the spelling out of ASCIIbased
string tags using characters that can be strictly separated from ordinary text content
characters in Unicode.
(all quotations from the chapter as above)
Staying within conventions, you can also use U+2028 (line separator) and/or U+2029 paragraph separator.
Technically, your use of U+E000–U+F8FF (the "Private Use Area") is okay-ish, because these code points only can define an unambiguous character in combination with a certain font. However, it is possible these codes may pop up if you get your plain text from a source where the font was included.
As for how to encode this into your strings: it doesn't really matter if the numerical code immediately following your private tag marker is a valid Unicode character or not. If you see one of your own tag markers, then the value immediately following is always your own private sequence number.
As you see, there are lots of possibilities. I guess the most important criterium is whether you want to use other functions on these strings. If you create a string that is technically invalid Unicode (for instance, because it includes not-a-character values), some external functions may choose to fail to work on them, or silently remove the bad values. In such a case, you'd need to rigorously stick to a system in which you only use 'valid' code points.

How can I find the character code of a special character in my text editor?

When pasting text from outside sources into a plain-text editor (e.g. TextMate or Sublime Text 2) a common problem is that special characters are often pasted in as well. Some of these characters render fine, but depending on the source, some might not display correctly (usually showing up as a question mark with a box around it).
So this is actually 2 questions:
Given a special character (e.g., ’ or ♥) can I determine the UTF-8 character code used to display that character from inside my text editor, and/or convert those characters to their character codes?
For those "extra-special" characters that come in as garbage, is there any way to figure out what encoding was used to display that character in the source text, and can those characters somehow be converted to UTF-8?

My favorite site for looking up characters is fileformat.info. They have a great Unicode character search that includes a lot of useful information about each character and its various encodings.
If you see the question mark with a box, that means you pasted something that can't be interpreted, often because it's not legal UTF-8 (not every byte sequence is legal UTF-8). One possibility is that it's UTF-16 with an endian mode that your editor isn't expecting. If you can get the full original source into a file, the file command is often the best tool for determining the encoding.

At &what I built a tool to focus on searching for characters. It indexes all the Unicode and HTML entity tables, but also supplements with hacker dictionaries and a database of keywords I've collected, so you can search for words like heart, quot, weather, umlaut, hash, cloverleaf and get what you want. By focusing on search, it avoids having to hunt around the Unicode pages, which can be frustrating. Give it a try.

Why does question mark show up in web browser?

I was (re)reading Joel's great article on Unicode and came across this paragraph, which I didn't quite understand:
For example, you could encode the Unicode string for Hello (U+0048
U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding,
or the Hebrew ANSI Encoding, or any of several hundred encodings that
have been invented so far, with one catch: some of the letters might
not show up! If there's no equivalent for the Unicode code point
you're trying to represent in the encoding you're trying to represent
it in, you usually get a little question mark: ? or, if you're really
good, a box. Which did you get? -> �
Why is there a question mark, and what does he mean by "or, if you're really good, a box"? And what character is he trying to display?

There is a question mark because the encoding process recognizes that the encoding can't support the character, and substitutes a question mark instead. By "if you're really good," he means, "if you have a newer browser and proper font support," you'll get a fancier substitution character, a box.
In Joel's case, he isn't trying to display a real character, he literally included the Unicode replacement character, U+FFFD REPLACEMENT CHARACTER.

It’s a rather confusing paragraph, and I don’t really know what the author is trying to say. Anyway, different browsers (and other programs) have different ways of handling problems with characters. A question mark “?” may appear in place of a character for which there is no glyph in the font(s) being used, so that it effectively says “I cannot display the character.” Browsers may alternatively use a small rectangle, or some other indicator, for the same purpose.
But the “�” symbol is REPLACEMENT CHARACTER that is normally used to indicate data error, e.g. when character data has been converted from some encoding to Unicode and it has contained some character that cannot be represented in Unicode. Browsers often use “�” in display for a related purpose: to indicate that character data is malformed, containing bytes that do not constitute a character, in the character encoding being applied. This often happens when data in some encoding is being handled as if it were in some other encoding.
So “�” does not really mean “unknown character”, still less “undisplayable character”. Rather, it means “not a character”.

A question mark appears when a byte sequence in the raw data does not match the data's character set so it cannot be decoded properly. That happens if the data is malformed, if the data's charset is explicitally stated incorrectly in the HTTP headers or the HTML itself, the charset is guessed incorrectly by the browser when other information is missing, or the user's browser settings override the data's charset with an incompatible charset.
A box appears when a decoded character does not exist in the font that is being used to display the data.

Just what it says - some browsers show "a weird character" or a question mark for characters outside of the current known character set. It's their "hey, I don't know what this is" character. Get an old version of Netscape, paste some text form Microsoft Word which is using smart quotes, and you'll get question marks.
http://blog.salientdigital.com/2009/06/06/special-characters-showing-up-as-a-question-mark-inside-of-a-black-diamond/ has a decent explanation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string