Unicode char issues in some representations

Unicode char issues in some representations - browser

Can anyone explain why only the first character is fragmented in this example?
<!DOCTYPE html/>
<html>
<body>
<span>𝅘𝅥𝅮</span>
<span>𝅘𝅥𝅮</span>
<span>𝅘𝅥𝅮</span>
</body>
</html>
Tested with Chrome 77 and Firefox 70, on Windows and Linux.
This is what I see:

It seems you get a bug (possibly from the font).
According Unicode chapter 21, figure 21-2, we have exactly your case:
U+1d160 = U+1d158 + U+1d165 + U+1d16e
so it should be shown as your third character (the second character depend on the encoding of the page).
The font correctly put the last two code point together, but it create too much distance between the first and the second. This seems contrary to the Unicode standard, so I can just assume it is a bug.
You may check other fonts (possibly webfont) and force to use webfonts for such characters. With webfont you are sure all users will see the same style. But with a short search, I cannot find good free webfonts with music notes.

Related

Web pages served by local IIS showing black diamonds with question marks

I'm having an issue in a .NET application where pages served by local IIS display random characters (mostly black diamonds with white question marks in them). This happens in Chrome, Firefox, and Edge. IE displays the pages correctly for some reason.
The same pages in production and in lower pre-prod environments work in all my browsers. This is strictly a local issue.
Here's what I've tried:
Deleted code and re-cloned (also tried switching branches)
Disabled all browser extensions
Ran in incognito mode
Rebooted (you never know)
Deleted temporary ASP.NET files
Looked for corrupt fonts on machine but didn't find any
Other Information:
Running IIS 10.0.17134.1
.NET MVC application with Knockout
I realize there are several other posts regarding black diamonds with question marks, but none of them seem to address my issue.
Please let me know if you need more information.
Thanks for your help!

You are in luck. The explicit purpose of � is to indicate that character encodings are being misused. When users see that, they'll know that we've messed up and lost some of their text data, and we'll know that, at one or more points, our processing and/or configuration is wrong.
(Fonts are not at issue [unless there as no font available to render �]. When there is no font available for a character, it's usually rendered as a white-filled rectangle.)
Character encoding fundamentals are simple: use a sufficient character set (say Unicode), pick an appropriate encoding (say UTF-8), encode text with it to obtain bytes, tell every program and person that gets the bytes that they represent text and which encoding is used. The encoding might be understood from a standard, convention, or specification.
Your editor does the actual encoding.
If the file is part of a project or similar system, a project file might store the intended encoding for all or each text file in the project. If your editor is an IDE, it should understand how the project does that.
Your compiler needs the know the encoding of each text file you give it. A project system would communicate what it knows.
HTML provides an optional way to communicate the encoding. Example: <meta charset="utf-8">. An HTML-aware editor should not allow this indicator to be different than the encoding it uses when saving the file. An HTML-aware editor might discover this indicator when opening the file and use the specified encoding to read the file.
HTTP uses another optional way: Content-Type response header. The web server emits this either statically or in conjunction with code that it runs, such as ASP.NET.
Web browsers use the HTTP way if given.
XHR (AJAX, etc) uses HTTP along with JavaScript processing. If needed the JavaScript processing should apply the HTTP and HTML rules, as appropriate. Note: If the content is JSON, the current RFC requires the encoding to be UTF-8.
No one or thing should have to guess.
Diagnostics
Which character encoding did you intend to use? This century, UTF-8 is so much the norm that if you choose to use a different one, you should have a good reason and document it (for others and your future self).
Compare the bytes in the file with the text you expect it to represent. Does it use the entended encoding? Use an editor or tool that shows bytes in hex.
As suggested by #snakecharmerb, what does the server send? Use a web browser's F12 network tab.
What does the HTTP response header say, if anything?
What does the HTML meta tag say, if anything?
What is the HTML doctype, if any?

How to convert "binary text" to "visible text"?

I have a text file full of non-ASCII characters.
I can not detect the encoding by either file or enca.
file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text
enca non_ascii.txt
Unrecognized encoding
But I can open it normally in Windows Notepad++
Edit: The expression above leads misunderstanding. Sorry for this.
In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.
The 2 parts shows as below. They are decoded in 2 different ways by notepad++.
Question:
How could I detect the files encoding under linux?
how do I recover the characters represented by <F1><EE><E9><E4><FF>?
I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?
The file content slice as follows:
less non_ascii.txt
"non_ascii.txt" may be a binary file. See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>

Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?
The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.
A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!
With that out of the way, converting is easy.
iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt
Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.
The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.
Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

Why pointer show unicode string reversed?

I have problems with unicode strings. My pointer to a string in farsi (saved as Unicode, codepage 1200) return the string reversed. Why? I know that farsi is a right-to-left language, but this is a C/C++ matter. My pointer to a string should point to the start of secuence as is stored in file.
I'm using VC++2005, standard console app.
Any help will be welcomed, I have attached screenshot and sample project.
test project
screen capture
Regards,
Juan

If the order is reversed in VC++2005, then probably it just does not handle directionality right, i.e. it displays Arabic letters left to right instead of properly obeying their inherent directionality. Such things happen in many editors and development tools. It does not as such affect the behavior of applications.

Special charecters in Windows shows right, in Linux just question marks

Why is that i get letters like š,č,ž written correct in Windows but as question marks in Linux? I'm using UTF8 encoding:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
I have also saved file as utf-8 type of file. Could it be because I've created and edited file in Windows?

Since you're using HTML, here's some straight HTML codes for Slavic Characters.
If that doesn't solve the problem, then I've seen this issue in SQL development; JimR is half-correct: the Character, rather than the font, is not supported.
Your font doesn't matter so much as your character set, i.e. whether or not the character exists in your context. The real question becomes, "Does your current Linux environment support that character in it's character set?"
If you're not sure, try exchanging UTF-8 for ISO/IEC 8859-16. A bit more heavy handed, and some extra coding if you need the other characters in UTF-8, but since it's the standard character set for the region surrounding Slovakia, there shouldn't be a reason it wouldn't work.
Should that fail, then the issue is clearly your Linux Distribution in some degree.
Best of luck!

Questions on Chinese Encoding

I'm trying to create a webpage in Chinese and I realized that while the text looks fine when I run it on browsers, once I change the Character Encoding, the text becomes gibberish. Here's what's happening:
I create my html file in Emacs, encoded in UTF-8.
I upload it to the server, and view it on my browsers (FF, IE, Chrome, Opera) - no problem.
I try to view the page in other encodings via FF > View > Character Encoding > All those different Chinese encoding systems, e.g. Chinese Simplified (HZ)
Apart from UTF-8, on every other encoding the text becomes gibberish.
I'm assuming this isn't a problem - i.e. browsers are smart enough to know which encoding the page is in, and parse the content accurately. What I'm wondering is why I can't read the Chinese text anymore once I change encoding - is it because I don't have Chinese fonts installed on my OS? Should I stick to UTF-8 if my audience are Chinese or should I choose among one of their many encoding systems?
Thanks in advance for your help/opinions.

UTF isn't a 'catch-all' encoding. It's designed to contain international language character symbols for ease of use, but it is still an encoding, just like the other encodings you've selected. You would have to retype the text in each encoding to make it appear correctly when viewed with that encoding.

Viewer encoding MUST match the file being read. Viewing UTF-8 as something other makes about same sense as renaming .txt to .exe and trying to run it.
You should specify correct encoding in HTML. The option you're using in web browser exist only for those rare occasions when web developer screwed up his job and declared other encoding than actually used OR mixed up 2 different encodings on one page.

Of course changing the encoding in your browser will "break" the text! The browser is taking the stream of UTF-8 codepoints and tries to force another encoding on the raw data. Needless to say, the result ain't pretty. Changing the encoding in the browser is NOT the equivalent of converting.
As you surmised correctly, modern browsers usually guess correctly -- but not always. As Agent_L make sure to declare the encoding in the headers.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string