How was this invisible space created? - string

'{FileTitle​}' === '{FileTitle}'
// false
There's a space between the first string's last e and }
'{FileTitle​}'.length
// 12
'{FileTitle}'.length
// 11

There is Unicode character with code 8203 between those two characters. This is a 0-width space. Have a look at the corresponding Wikipedia article for more info.

This is a great example of a sometimes nasty problem :-)
If I copy your code to TextWrangler, then I see the space. If I chose "Hex Dump", then I see the hex bytes 0B 20. Considering the little endian context (thx to #axiac), this means the character 0x200B, decimal 8203.
For informations about specific unicode characters, use this: http://unicode-table.com/de/search/?q=8203 You'll see the description "Zero Width Space".
About how this character got into your code, one can only guess. Option one is, you wrote it in your editor unwittingly by hitting a certain key combination. Option two is, you copied it from a rich text document as a stowaway. Option three is, it got there because of some stumbled multibyte string operation.
A related problem is Ascii 0xA0 (or 0x00A0), the non-breakable space. It cannot be distinguished from a normal space by eye, but causes compiler syntax errors sometimes hard to resolve.

This one has no space
'{FileTitle}' === '{FileTitle}'
You just used different encoding.

Related

How to get unicode of characters from 55296 to 56319 in Excel

I generated a list of letters in excel, from character codes 1 to 66535.
I am trying to get back the unicode by using the function "UNICODE". However, excel return #VALUE! for character codes from 55296 to 56319.
Please advise if there are any other function that can return a proper unicodes.
Thank you.
The range you are listing is a special range in Unicode: surrogates.
So, they have Unicode code point, but the problem it is you cannot have them in a text: Windows uses UCS-2/UTF-16 as internal encoding, so there are no way you can put in text. Or better: you to have code points above 65535, Windows uses two surrogates, one in the range 0xD800-0xDBFF (high surrogate) and the second one 0xDC00-0xDFFF )low surrogate). By combining these two, you have all Unicode code points.
But so, you should never have a single surrogate (or a mismatch surrogate, e.g. a high surrogate not followed a low surrogate, or a low surrogate not preceded be a high surrogate).
So, just skip such codes. Or better use them correctly to have characters above 65535.
Note: you cannot have all Unicode characters only with one code point. many characters requires combining many code points (there is a whole category of "combining characters" in Unicode). E.g. the zero with a oblique line is rendered with two unicode characters: the normal zero, and a variant selector. Also accented characters are very limited (and often with just one accent per characters). And without going to more complex scripts.

Vim: Utf-8 ې character breaks displayed string

I have file that has hex content: db90 3031 46, which should be displayed in vim as "ې" followed by "01F", but what I noticed is that it is never displayed correctly. Then I noticed It is the same in other places like in terminal and browser I always get ې01F? Why is that? Just paste that in google and try yourself you will never be able to put "ې" and 0 as next character.
That's an Arabic character with right-to-left indicator, so you probably need to switch back to left-to-right mode, such as with U+200e.
The Unicode bidirectional stuff is rather complex - the behaviour you are seeing is probably caused by the fact that the Latin digits are marked EN = European number (a weak type), while letters such as F are marked L = left to right (a strong type).
Weak types are treated differently in the Unicode specification, such as with this quote which covers your particular case (my emphasis):
Problematic cases may occur when a right-to-left paragraph begins with left-to-right characters, or there are nested segments of different-direction text, or there are weak characters on directional boundaries. In these cases, embeddings or directional marks may be required to get the right display.
So your code point followed by a digit renders as "ې7" (I typed that 7 in after the Arabic character despite the fact it's showing up before it), while following it with a letter gives "ېX".
For what it's worth, the text "ې‎7" was generated here by inserting ‎ between the two characters, the HTML equivalent of the U+200e Unicode code point.
If you head on over to this UTF-8 codec site and enter %u06D0%u200e7 into the decoding section, you'll see that it comes out in your desired order (removing the %200e shows it in the order you're describing in your question).

Why this excel string comparison return fail?

Is it an Excel bug? Anyone have experienced this issue, please help?
Just a thought but here's what MS says about TRIM
The TRIM function was designed to trim the 7-bit ASCII space character
(value 32) from text. In the Unicode character set, there is an
additional space character called the nonbreaking space character that
has a decimal value of 160. This character is commonly used in Web
pages as the HTML entity, . By itself, the TRIM function does
not remove this nonbreaking space character.
you might try this to replace the non-breaking space (if that is your problem here).
=TRIM(SUBSTITUTE(A5,CHAR(160),CHAR(32)))
I would have to agree with #Jeeped. Your formula looks correct in all aspects. It must be a non-printing character. If this data is coming from some outside source (I.e. another file) then there very well could be a non-printed character. I just typed in everything you had manually and came up with this.

How can I find the character code of a special character in my text editor?

When pasting text from outside sources into a plain-text editor (e.g. TextMate or Sublime Text 2) a common problem is that special characters are often pasted in as well. Some of these characters render fine, but depending on the source, some might not display correctly (usually showing up as a question mark with a box around it).
So this is actually 2 questions:
Given a special character (e.g., ’ or ♥) can I determine the UTF-8 character code used to display that character from inside my text editor, and/or convert those characters to their character codes?
For those "extra-special" characters that come in as garbage, is there any way to figure out what encoding was used to display that character in the source text, and can those characters somehow be converted to UTF-8?
My favorite site for looking up characters is fileformat.info. They have a great Unicode character search that includes a lot of useful information about each character and its various encodings.
If you see the question mark with a box, that means you pasted something that can't be interpreted, often because it's not legal UTF-8 (not every byte sequence is legal UTF-8). One possibility is that it's UTF-16 with an endian mode that your editor isn't expecting. If you can get the full original source into a file, the file command is often the best tool for determining the encoding.
At &what I built a tool to focus on searching for characters. It indexes all the Unicode and HTML entity tables, but also supplements with hacker dictionaries and a database of keywords I've collected, so you can search for words like heart, quot, weather, umlaut, hash, cloverleaf and get what you want. By focusing on search, it avoids having to hunt around the Unicode pages, which can be frustrating. Give it a try.

Why does question mark show up in web browser?

I was (re)reading Joel's great article on Unicode and came across this paragraph, which I didn't quite understand:
For example, you could encode the Unicode string for Hello (U+0048
U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding,
or the Hebrew ANSI Encoding, or any of several hundred encodings that
have been invented so far, with one catch: some of the letters might
not show up! If there's no equivalent for the Unicode code point
you're trying to represent in the encoding you're trying to represent
it in, you usually get a little question mark: ? or, if you're really
good, a box. Which did you get? -> �
Why is there a question mark, and what does he mean by "or, if you're really good, a box"? And what character is he trying to display?
There is a question mark because the encoding process recognizes that the encoding can't support the character, and substitutes a question mark instead. By "if you're really good," he means, "if you have a newer browser and proper font support," you'll get a fancier substitution character, a box.
In Joel's case, he isn't trying to display a real character, he literally included the Unicode replacement character, U+FFFD REPLACEMENT CHARACTER.
It’s a rather confusing paragraph, and I don’t really know what the author is trying to say. Anyway, different browsers (and other programs) have different ways of handling problems with characters. A question mark “?” may appear in place of a character for which there is no glyph in the font(s) being used, so that it effectively says “I cannot display the character.” Browsers may alternatively use a small rectangle, or some other indicator, for the same purpose.
But the “�” symbol is REPLACEMENT CHARACTER that is normally used to indicate data error, e.g. when character data has been converted from some encoding to Unicode and it has contained some character that cannot be represented in Unicode. Browsers often use “�” in display for a related purpose: to indicate that character data is malformed, containing bytes that do not constitute a character, in the character encoding being applied. This often happens when data in some encoding is being handled as if it were in some other encoding.
So “�” does not really mean “unknown character”, still less “undisplayable character”. Rather, it means “not a character”.
A question mark appears when a byte sequence in the raw data does not match the data's character set so it cannot be decoded properly. That happens if the data is malformed, if the data's charset is explicitally stated incorrectly in the HTTP headers or the HTML itself, the charset is guessed incorrectly by the browser when other information is missing, or the user's browser settings override the data's charset with an incompatible charset.
A box appears when a decoded character does not exist in the font that is being used to display the data.
Just what it says - some browsers show "a weird character" or a question mark for characters outside of the current known character set. It's their "hey, I don't know what this is" character. Get an old version of Netscape, paste some text form Microsoft Word which is using smart quotes, and you'll get question marks.
http://blog.salientdigital.com/2009/06/06/special-characters-showing-up-as-a-question-mark-inside-of-a-black-diamond/ has a decent explanation.

Resources