Chinese conversion in MultiByteToWideChar

Chinese conversion in MultiByteToWideChar - visual-c++

I'm trying to display a Chinese text in the MessageBoxW. But I can't correctly convert it from UTF-8 to wchar_t. At the same time, the original wchar_t Chinese is displayed correctly.
I played with different MultiByteToWideChar flags but with the same result. What the reason of the incorrect conversion?

char text[] = "文本" is only UTF-8 if the source file is encoded in UTF-8. Since your title string displays correctly your encoding is the default Chinese legacy encoding on Windows, and the text string contains bytes in that encoding, and not UTF-8, so MultiByteToWideChar fails. You can see that the function returns zero if you set the flag to check for invalid characters, which happens if it isn't really UTF-8:
int ret = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, text, -1, wtext, 1000);
The Microsoft compiler has options to specify source and execution character set, and a /utf-8 option (recommended):
/source-charset:<iana-name>|.nnnn set source character set
/execution-charset:<iana-name>|.nnnn set execution character set
/utf-8 set source and execution character set to UTF-8
Multiple options to fix. #2 and #3 assume the Microsoft compiler. Other compilers may vary.
Use char text[] = u8"文本"; since your existing default encoding supports Chinese. The source characters will be interpreted in that encoding and then re-encoded in UTF-8 with this notation. If the source is sent to someone with different OS default encoding, it will not work for them.
Re-save the source as UTF-8 w/ BOM. The MS compiler will detect the BOM (byte order mark used as a UTF-8 signature) and process the source as if /utf-8 was specified. text will contain UTF-8 bytes. Title will display correctly.
Re-save as UTF-8 (no BOM) and compile with the /utf-8 switch to inform the compiler to decode the source as UTF-8 instead of the default encoding.
Use ASCII-only source and escape codes to specify the Chinese character explicitly.
Example of #4 that will compile correctly no matter the OS default encoding:
#include <windows.h>
int main() {
char text[] = "\xe6\x96\x87\xe6\x9c\xac";
wchar_t wtext[1000];
MultiByteToWideChar(CP_UTF8, 0, text, -1, wtext, 1000);
MessageBoxW(NULL, wtext, L"\u6a19\u984c", MB_OK);
return 0;
}

Related

How does VIM perform charset conversion?

I see the following paragraph on the vim documentation for the introduction of charset conversion:
Vim will automatically convert from one to another encoding in several places:
- When reading a file and 'fileencoding' is different from 'encoding'
- When writing a file and 'fileencoding' is different from 'encoding'
- When displaying characters and 'termencoding' is different from 'encoding'
- When reading input and 'termencoding' is different from 'encoding'
- When displaying messages and the encoding used for LC_MESSAGES differs from
'encoding' (requires a gettext version that supports this).
- When reading a Vim script where |:scriptencoding| is different from
'encoding'.
- When reading or writing a |viminfo| file.
I want to know who is converting to who? such as:
"When reading a file and 'fileencoding' is different from 'encoding'"
Is 'fileencoding' converted to 'encoding'? Or is 'encoding' converted to 'fileencoding'?
What is the relationship between the actual charset of the file and fileencoding and encoding?
If the actual charset of the file and the value of fileencoding are not equal, will the above conversion operations destroy the contents of the file?
UPDATE:
For example: the value of encoding is: utf-8 , vim opens a file: foo, and based on fileencodings matches a fileencoding value: sjis (assuming i don't know the actual encoding of this file.), I edited foo and used ":wq" to save and close the vim window. If I open the foo file again, is the actualencoding of this file the sjis specified by fileencoding or the utf-8 specified by encoding when I last edited?

'encoding' is the internal representation of any buffer text inside Vim; this is what Vim is working on. When you're dealing with different character sets (or if you don't care and work on a modern operating system), it's highly recommended to set this to utf-8, as the Unicode encoding ensures that any character can be represented and no information is lost. (And UTF-8 is the only Unicode representation that Vim internally supports; i.e. you cannot make it use a double-byte encoding like UTF-16.)
When you open a file in Vim, the list of possible encodings in 'fileencodings' (note the plural!) is considered:
This is a list of character encodings considered when starting to edit
an existing file. When a file is read, Vim tries to use the first
mentioned character encoding. If an error is detected, the next one
in the list is tried. When an encoding is found that works,
'fileencoding' is set to it.
So if a file doesn't look right, this is the option to tweak; alternatively, you can explicitly override the detection via the ++enc argument, e.g.
:edit ++enc=sjis japanese.txt
Now, Vim has the file's source encoding (persisted in (singular!) 'fileencoding'; this is needed for writing it back in the original encoding), and converts the character set (if different) to it's internal 'encoding'. All Vim commands operate on that, and on :write, the conversion happens in reverse (or optionally overridden by :w ++enc=...).
Conclusions
As long as the detected / passed encoding is right, and assuming the internal 'encoding' is able to represent all read characters (guaranteed with utf-8), there will be no data loss.
Likewise, as the original encoding is stored in 'fileencoding', writes of the file transparently convert back. Now, it could have happened that editing introduced a character that cannot be represented in the file's encoding (but you were able to edit it in because of Vim's internal Unicode encoding). Vim will then print E513: write error, conversion failed on writing, and you have to manually change the character(s), or choose a different target file encoding.
Example
A file with these Kanji characters 日本 is represented as follows in the SJIS encoding:
93fa 967b 0a
Each Kanji is stored in two bytes, and then you have the one-byte newline (LF) at the end.
With :set encoding=utf-8, this is represented internally as (g8 can tell you this):
e697 a5e6 9cac 0a
In UTF-8, each Kanji is stored in three bytes, the first Kanji is e6 97 a5.
Now if I edit the text, e.g. enclosing with (ASCII) parentheses, and :write, I get this:
2893 fa96 7b29 0a
The original SJIS encoding is restored, each Kanji is two bytes again, now with the added parentheses 28 and 29 around it.
Had I tried to edit in a ä character, the :write would have failed with the E513 error, as that character cannot be represented in SJIS.

How do I get from Ã‰phÃ©mÃ¨re to Éphémère in Python3?

I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.

This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
Ã‰ at the place of É
Ã© at the place of é
Ã¨ at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character Ã (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders Ã‰ instead of É. Idem for the other non-ASCII characters é and è.

My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')

What encodings are available Node.js

I'm trying to figure out what encodings are available in node.js.
The documentation (http://nodejs.org/api/buffer.html#buffer_new_buffer_str_encoding) says:
Allocates a new buffer containing the given str. encoding defaults to 'utf8'.
but nowhere is specified a list of available encodings. Maybe I missed it.
I'm working on script which should be able to output in wide range of encodings. So far I know only about utf8 as doc is saying :)
Thx, Jaro.

Encodings available in Node.js:
ascii: For 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.
utf8: Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.
utf16le: 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.
ucs2: Alias of utf16le.
base64: Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe
Alphabet" as specified in RFC 4648, Section 5.
latin1: A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC 1345, page 63, to be the Latin-1
supplement block and C0/C1 control codes).
binary: Alias for 'latin1'.
hex: Encode each byte as two hexadecimal characters.
Source: Node 12 Buffer documentation

in order to clarify lastest nodejs encoding explaination. I paste from nodejs document. v4+ does't deprated binary
The character encodings currently supported by Node.js include:
'ascii' - For 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.
'utf8' - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.
'utf16le' - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.
'ucs2' - Alias of 'utf16le'.
'base64' - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC 4648, Section 5.
'latin1' - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC 1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).
'binary' - Alias for 'latin1'.
'hex' - Encode each byte as two hexadecimal characters.

How to substitute cp1250 specific characters to utf-8 in Vim

I have some central european characters in cp1250 encoding in Vim. When I change encoding with set encoding=utf-8 they appear like <d0> and such. How can I substitute over the entire file those characters for what they should be, i.e. Đ, in this case?

As sidyll said, you should really use iconv for the purpose. Iconv knows stuff. It knows all the hairy encodings, onscure code-points, katakana, denormalized, canonical forms, compositions, nonspacing characters and the rest.
:%!iconv --from-code cp1250 --to-code utf-8
or shorter
:%!iconv -f cp1250 -t utf-8
to filter the whole buffer. If you do
:he xxd
You'll get a sample of how to automatically encode on buffer load/save if you wanted.
iconv -l will list you all (many: 1168 on my system) encodings it accepts/knows about.
Happy hacking!

The iconv() function may be useful:
iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.
The encoding names are whatever the iconv() library function
can accept, see ":!man 3 iconv".
Most conversions require Vim to be compiled with the |+iconv|
feature. Otherwise only UTF-8 to latin1 conversion and back
can be done.
This can be used to display messages with special characters,
no matter what 'encoding' is set to. Write the message in
UTF-8 and use:
echo iconv(utf8_str, "utf-8", &enc)
Note that Vim uses UTF-8 for all Unicode encodings, conversion
from/to UCS-2 is automatically changed to use UTF-8. You
cannot use UCS-2 in a string anyway, because of the NUL bytes.
{only available when compiled with the +multi_byte feature}

You can set encoding to the value of your file's encoding and termencoding to UTF-8. See The vim mbyte documentation.

How come VC++ 2010 uses char buffer rather than wchar_t buffer to represent basic_filebuf<wchar_t>?

It is very very strange to me that the VC++ docs says: (at: http://msdn.microsoft.com/en-us/library/tzf8k3z8(VS.90).aspx)
"Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer. To store Unicode strings in the buffer, create a new buffer of type wchar_t and set it using the basic_streambuf::pubsetbuf() method. To see an example that demonstrates this behavior, see below."
Why?

This is only a guess but it may be this way to handle the common case (at least on Windows) where the program's internals are wchar_t (16-bit Unicode characters) but most/all text files it outputs are 8-bit ANSI.
Most text files still seem to be ANSI unless they really need to be otherwise, and many programs cannot cope properly with Unicode text files.
I wonder if it's really an ANSI string or a UTF-8 string that it converts to...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string