Unicode underline does not completely underline some characters - string

I have a string in a C# application that needs to be underlined. This needs to be done in unicode as the string is exported and displayed in a word file. To do this I preceded every character with the underline unicode \u0332which works, but it does not completely underline the 'm' character as seen in this screenshot:
I have tried preceding the \u0332 a few times before the m and after but the output is always the same.
Is there any way to get it to completely underline the character?
EDIT: I just tried using the continuous underline unicode symbol \u2381 but that does not render at all.

U+0332 is a Unicode combining character, so ist goes after the character that it modifies. But this only specifies that the character should be underlined. The specific graphical representation depends on the application and its rendering engine; it's not fully supported everywhere. Try to paste the text i̲m̲p̲o̲r̲t̲a̲n̲t̲ into the application and see if it works as intended. If not, then there is nothing you can do, except using another representation such as *important* or IMPORTANT, or exporting in a supported rich text format (RTF, docx, etc.).

Related

Vim: Utf-8 ې character breaks displayed string

I have file that has hex content: db90 3031 46, which should be displayed in vim as "ې" followed by "01F", but what I noticed is that it is never displayed correctly. Then I noticed It is the same in other places like in terminal and browser I always get ې01F? Why is that? Just paste that in google and try yourself you will never be able to put "ې" and 0 as next character.
That's an Arabic character with right-to-left indicator, so you probably need to switch back to left-to-right mode, such as with U+200e.
The Unicode bidirectional stuff is rather complex - the behaviour you are seeing is probably caused by the fact that the Latin digits are marked EN = European number (a weak type), while letters such as F are marked L = left to right (a strong type).
Weak types are treated differently in the Unicode specification, such as with this quote which covers your particular case (my emphasis):
Problematic cases may occur when a right-to-left paragraph begins with left-to-right characters, or there are nested segments of different-direction text, or there are weak characters on directional boundaries. In these cases, embeddings or directional marks may be required to get the right display.
So your code point followed by a digit renders as "ې7" (I typed that 7 in after the Arabic character despite the fact it's showing up before it), while following it with a letter gives "ېX".
For what it's worth, the text "ې‎7" was generated here by inserting ‎ between the two characters, the HTML equivalent of the U+200e Unicode code point.
If you head on over to this UTF-8 codec site and enter %u06D0%u200e7 into the decoding section, you'll see that it comes out in your desired order (removing the %200e shows it in the order you're describing in your question).

how to underline text in python 3.6.5

How can I print underlined text similar to what is shown on wikipedia in python? What unicode characters would I give to python to make this work?
In Python, arbitrary unicode characters can be expressed with \uXXXX where XXXX is a four-digit hex number identifying the code point.
Wikipedia shows the use of "combining low line" (U+0332).
Since it's a combining character, you need to place it after each character you want to be underlined.
So this code should print aaau̲zzz (u should be underlined in most browsers).
print('aaau\u0332zzz')
Note that this doesn't seem to work very well.
My gnome-terminal (which identifies as GNOME Terminal 3.26.2 Using VTE version 0.50.3 +GNUTLS), using Monospace Regular font, mis-renders the underline on the following character:
But if I copy the resulting text and paste it onto Stack Overflow, it seems to render correctly (Chrome on Linux):
aaau̲zzz
Unless I format it as code:
aaau̲zzz
In which case it doesn't "combine" at all.
Here's a screenshot of the above, in case your browser renders it differently:

How can I find the character code of a special character in my text editor?

When pasting text from outside sources into a plain-text editor (e.g. TextMate or Sublime Text 2) a common problem is that special characters are often pasted in as well. Some of these characters render fine, but depending on the source, some might not display correctly (usually showing up as a question mark with a box around it).
So this is actually 2 questions:
Given a special character (e.g., ’ or ♥) can I determine the UTF-8 character code used to display that character from inside my text editor, and/or convert those characters to their character codes?
For those "extra-special" characters that come in as garbage, is there any way to figure out what encoding was used to display that character in the source text, and can those characters somehow be converted to UTF-8?
My favorite site for looking up characters is fileformat.info. They have a great Unicode character search that includes a lot of useful information about each character and its various encodings.
If you see the question mark with a box, that means you pasted something that can't be interpreted, often because it's not legal UTF-8 (not every byte sequence is legal UTF-8). One possibility is that it's UTF-16 with an endian mode that your editor isn't expecting. If you can get the full original source into a file, the file command is often the best tool for determining the encoding.
At &what I built a tool to focus on searching for characters. It indexes all the Unicode and HTML entity tables, but also supplements with hacker dictionaries and a database of keywords I've collected, so you can search for words like heart, quot, weather, umlaut, hash, cloverleaf and get what you want. By focusing on search, it avoids having to hunt around the Unicode pages, which can be frustrating. Give it a try.

How do I create spacial characters used in XML in my C# code

I have the following string, read from an XML attribute:
"OnTrak 4-3/4”, 6-3/4”, 8-1/4” / MPR"
In my C# application it shows up nicely formatted like this
"OnTrak 4-3/4”, 6-3/4”, 8-1/4” / MPR"
This is the form I see in the debugger, a combobox, or on this forum (if I don't indent to specify code).
What I want to do is specify the same string as a C# variable and have it show up nicely formatted when the application runs. Unfortunately, all I get is the string as I literally typed it.
I have tried to play around with converting the encoding from ASCII to UTF8 with no luck. How can I get this special character properly formatted, and where can I find a list of these symbols?
Those are called XML entities. Use HttpUtility.HtmlDecode to decode them back to plain text like you would like. Credit goes to C#, function to replace all html special characters with normal text characters for how to convert entities in C#
Note that converting from ASCII to UTF8 (and Unicode etc.) is called changing the character set and is usually done when specific characters are in the string. For instance if you strings contained Chinese characters you couldn't use ASCII. In this simple case you shouldn't need to convert character sets because C# strings are Unicode character set by default and XML entities are Unicode based (I believe).

Why does question mark show up in web browser?

I was (re)reading Joel's great article on Unicode and came across this paragraph, which I didn't quite understand:
For example, you could encode the Unicode string for Hello (U+0048
U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding,
or the Hebrew ANSI Encoding, or any of several hundred encodings that
have been invented so far, with one catch: some of the letters might
not show up! If there's no equivalent for the Unicode code point
you're trying to represent in the encoding you're trying to represent
it in, you usually get a little question mark: ? or, if you're really
good, a box. Which did you get? -> �
Why is there a question mark, and what does he mean by "or, if you're really good, a box"? And what character is he trying to display?
There is a question mark because the encoding process recognizes that the encoding can't support the character, and substitutes a question mark instead. By "if you're really good," he means, "if you have a newer browser and proper font support," you'll get a fancier substitution character, a box.
In Joel's case, he isn't trying to display a real character, he literally included the Unicode replacement character, U+FFFD REPLACEMENT CHARACTER.
It’s a rather confusing paragraph, and I don’t really know what the author is trying to say. Anyway, different browsers (and other programs) have different ways of handling problems with characters. A question mark “?” may appear in place of a character for which there is no glyph in the font(s) being used, so that it effectively says “I cannot display the character.” Browsers may alternatively use a small rectangle, or some other indicator, for the same purpose.
But the “�” symbol is REPLACEMENT CHARACTER that is normally used to indicate data error, e.g. when character data has been converted from some encoding to Unicode and it has contained some character that cannot be represented in Unicode. Browsers often use “�” in display for a related purpose: to indicate that character data is malformed, containing bytes that do not constitute a character, in the character encoding being applied. This often happens when data in some encoding is being handled as if it were in some other encoding.
So “�” does not really mean “unknown character”, still less “undisplayable character”. Rather, it means “not a character”.
A question mark appears when a byte sequence in the raw data does not match the data's character set so it cannot be decoded properly. That happens if the data is malformed, if the data's charset is explicitally stated incorrectly in the HTTP headers or the HTML itself, the charset is guessed incorrectly by the browser when other information is missing, or the user's browser settings override the data's charset with an incompatible charset.
A box appears when a decoded character does not exist in the font that is being used to display the data.
Just what it says - some browsers show "a weird character" or a question mark for characters outside of the current known character set. It's their "hey, I don't know what this is" character. Get an old version of Netscape, paste some text form Microsoft Word which is using smart quotes, and you'll get question marks.
http://blog.salientdigital.com/2009/06/06/special-characters-showing-up-as-a-question-mark-inside-of-a-black-diamond/ has a decent explanation.

Resources