Windows to UTF-8 Character Encoding Behaviour Query - linux

A simple query about expected behaviour when compiling Windows-1252 characters under UTF-8. When building using an ant task on java source code it seems that some weird character encoding occurs.
For certain fields characters that are normally encoded as \u2013 on the windows machine for example, turn into \226 on Linux. What is the explanation for the \226? Will it still be rendered correctly on a browser, for example?

Related

Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819

All our source code is valid UTF-8, however some users on Windows cannot build them because their system is configured for a different encoding.
Without adding a BOM to source files, is it possible to tell MSVC to treat all source as UTF-8, irrespective of the users system encoding?
See MSDN's link regarding this topic (requires adding BOM header).
You can try:
add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")
By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you have specified a code page by using /utf-8 or the /source-charset option.
References
Docs - Visual C++ - ‎Documentation - IDE and Tools - Building - Build Reference: /utf-8 (Set Source and Executable character sets to UTF-8)
If you happen to create cross-platform code solving the problem using a command-line switch means that
add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")
or adding something like /utf-8 or /source-charset to the CFLAGs might mean you'll have to do a similar thing for other platforms, as well.
If possible it therefore might be better to avoid the problem, instead of solving it, by using an \uxxxx instead of an unicode character in strings: This way the source specifies which unicode characters to use, but doesn't actually contain them.

D Language fails to display german Umlaute on Windows?

As you can see, D fails to output german Umlaute. At least on Windows. On Linux or BSD the same program outputs the string as I've saved it.
I already tried wstring or dstring, but the output is the same.
What am I doing wrong?
D will output UTF-8 regardless of the operating system. How the output will be interpreted depends on how it is displayed. In this particular case, it looks like your IDE is interpreting the output as if it was encoded in the Windows-1252 encoding.
For the standard Windows console, you could change the output encoding by calling SetConsoleOutputCP(65001), but note that this may have some undesired side effects (you should restore the codepage before your progam exits, and batch files may not run while the console output codepage is set to 65001).
CyberShadows post guided me to an acceptable answer. :-)
In Eclipse it is possible to change the output-encoding without changing global settings of the OS.
Go to Run --> Run-Configurations...
There select the Common-Tab and change the encoding to UTF-8. Now german Umlaute are displayed correctly. At least in Eclipse. :-)
Another possibility is to use https://babun.github.io/ . It is a Cygwin-based Shell that ouputs UTF-8:

wxformbuilder and unicode labels

Is there a way to get Unicode characters into label code generated by wxFormBuilder?
For example, to get an Angstrom character the generated string should read u"\u212b".
I tried entering \u212b in the label property field but the resulting string reads u"u212b". So I tried escaping the backslash as \\u212b but that gave me u"\\u212b".
I'm using wxFormBuilder v3.5 - beta. Generating Python code, although the C++ code shows the same behaviour.
By default, wxFormBuilder includes this command (# -- coding: utf-8 --
) on the first line at least for the python code generated.
So I went into MS word and inserted the Angstrom character Å, I then copied it into wxFormBuilder (Version 3.5 - RC1) statictext control and it worked on running the code.
Try my approach above instead of typing "u212b". Or type directly in your code like so: u"Hello... Å"

nonascii string literal is escaped after build

Please help me to solve a problem that appeared recently.
When release project is build on built machine (using msbuild) all string literals in code are escaped with \x00NN where nn are two digitals. The problem is that if such values are displayed in form (winforms) they appear as broken encoded (like broken codepage in www)
in source code it looks like
str = " Без ПДВ"
but reflector shows
str = " \x00c1\x00e5\x00e7 \x00cf\x00c4\x00c2";
And this appears as string with broken encoding in the form, like
â ò.÷. ÏÄÂ
.
What causes msbuild to convert non-ascii string literals to escaped symbols? There is no such problem for dev builds on developers machines.
Regional settings were checked for user that runs ms-build and were changed from German to Ukrainian, the same was done for non-unicode programs language. It does not help after reboot.
MsBuild has worked without such problem on the same machine for one year but latest build beraks string literals in code
command-line looks like
MSBuild {LocalPath}{Solution} /property:DefineConstants="{Defines}{DefinesExtra}" /t:{Target} /property:Configuration={Configuration} {Platform} /clp:NoItemAndPropertyList
Target is Build (or Rebuild it does not matter) configuration is release, platform is x86
PS I know that this is bad to store localized strings in code (but shit happens).

Questions on Chinese Encoding

I'm trying to create a webpage in Chinese and I realized that while the text looks fine when I run it on browsers, once I change the Character Encoding, the text becomes gibberish. Here's what's happening:
I create my html file in Emacs, encoded in UTF-8.
I upload it to the server, and view it on my browsers (FF, IE, Chrome, Opera) - no problem.
I try to view the page in other encodings via FF > View > Character Encoding > All those different Chinese encoding systems, e.g. Chinese Simplified (HZ)
Apart from UTF-8, on every other encoding the text becomes gibberish.
I'm assuming this isn't a problem - i.e. browsers are smart enough to know which encoding the page is in, and parse the content accurately. What I'm wondering is why I can't read the Chinese text anymore once I change encoding - is it because I don't have Chinese fonts installed on my OS? Should I stick to UTF-8 if my audience are Chinese or should I choose among one of their many encoding systems?
Thanks in advance for your help/opinions.
UTF isn't a 'catch-all' encoding. It's designed to contain international language character symbols for ease of use, but it is still an encoding, just like the other encodings you've selected. You would have to retype the text in each encoding to make it appear correctly when viewed with that encoding.
Viewer encoding MUST match the file being read. Viewing UTF-8 as something other makes about same sense as renaming .txt to .exe and trying to run it.
You should specify correct encoding in HTML. The option you're using in web browser exist only for those rare occasions when web developer screwed up his job and declared other encoding than actually used OR mixed up 2 different encodings on one page.
Of course changing the encoding in your browser will "break" the text! The browser is taking the stream of UTF-8 codepoints and tries to force another encoding on the raw data. Needless to say, the result ain't pretty. Changing the encoding in the browser is NOT the equivalent of converting.
As you surmised correctly, modern browsers usually guess correctly -- but not always. As Agent_L make sure to declare the encoding in the headers.

Resources