VC++ Converting Unicode Traditional Chinese characters to multi byte not always work

VC++ Converting Unicode Traditional Chinese characters to multi byte not always work - visual-c++

My application (MFC) is an Unicode app, and I have a third party DLL which only takes the multi-byte characters, so I have to convert the Unicode string to the multi-byte string to pass it to the 3rd party app. Korean, Japanese and even Simplified chinese strings were converted correctly except Traditional chinese. Below describes my attempt. This CPP is encoded in Unicode.
CString strFilePath(_T("中文字 深水埗.docx"));// (_T("中文字 深水埗.docx"));
wchar_t tcharPath[260];
wcscpy(tcharPath, (LPCTSTR)(strFilePath));
CString strAll = strFilePath;
int strAllLength = strAll.GetLength() + 1;
int nSize = 0;
char * pszBuf;
CPINFO pCPInfo;
BOOL bUsedDefaultChar = FALSE;
int nRC = GetCPInfo( CP_ACP, &pCPInfo );
if ((nSize = WideCharToMultiByte( CP_ACP, 0, strAll, strAllLength, NULL, 0, NULL, NULL )) > 0)
{ // Get the size of the buffer length
pszBuf = (char*)calloc(nSize + 1, sizeof(char)); // allocate the buffer
if (pszBuf == NULL)
return; // no more memory
nRC = WideCharToMultiByte( CP_ACP, 0, strAll, strAll.GetLength(), pszBuf, nSize+1, NULL, &bUsedDefaultChar ); // store Unicode chars to pszBuf
DWORD dwErr = GetLastError();
::MessageBoxA( NULL, pszBuf, "", MB_OK );
free(pszBuf); // free it.
}
With Simplified Chinese Windows, the above 6 chinese characters were displayed correctly. Unfortunately, in Traditional Chinese Windows, the 6th character "埗" couldn't be, so it was converted to "?".
Can anyone explain why and tell me if it is possible to convert correctly?

Related

Last character of a multibyte string

One of the things I often need to do when handling a multibyte string is deleting its last character. How do I locate this last character so I can chop it off using normal byte operations, preferably with as few reads as possible?
Note that this question is intended to work for most, if not all, multibyte encodings. The answer for self-synchonizing encodings like UTF-8 is trivial, as you can just go right-to-left in the bytestring for a start marker.

The answer will be written in C, with POSIX multibyte functions. The said functions are also found on Windows. Assume that the bytestring ends at len and is well-formed up to the point; assume appropriate setlocale calls. Porting to mbrlen is left as an exercise for the reader.
The naive solution
The obviously correct solution involves parsing the encoding "as intended", going from left-to-right.
ssize_t index_of_last_char_left(const char *c, size_t len) {
size_t pos = 0;
size_t next = 1;
mblen(NULL, 0);
while (pos < len - 1) {
next = mblen(c + pos, len - pos);
if (next == -1) // Invalid input
return pos;
pos += next;
}
return pos - next;
}
Deleting multiple characters like this will cause an "accidentally quadratic" situation; memorizing of intermediate positions will help, but additional management is required.
The right-to-left solution
As I mentioned in the question, for self-synchonizing encodings the only thing to do is to look for a start marker. But what breaks with the ones that don't self-synchonize?
The one-or-two-byte EUC encodings have both bytes of the two-byte sequence higher than 0x7f, and there's almost no differentiating between start and continuation bytes. For that we can check for mblen(pos) == bytes_left since we know the string is well-formed.
The Big5, GBK, and GB10830 encodings also allow a continuation byte in the ASCII range, so a lookbehind is mandatory.
With that cleared out (and assuming the bytestring up to len is well-formed), we can have:
// As much as CJK encodings do. I don't have time to see if it works for UTF-1.
#define MAX_MB_LEN 4
ssize_t index_of_last_char_right(const char *c, size_t len) {
ssize_t pos = len - 1;
bool last = true;
bool last_is_okay = false;
assert(!mblen(NULL, 0)); // No, we really cannot handle shift states.
for (; pos >= 0 && pos >= len - 2 - MAX_MB_LEN; pos--) {
int next = mblen(c + pos, len - pos);
bool okay = (next > 0) && (next == len - pos - 1);
if (last) {
last_is_okay = okay;
last = false;
} else if (okay)
return pos;
}
return last_is_okay ? len - 1 : -1;
}
(You should be able to find the last good char of a malformed string by (next > 0) && (next <= len - pos - 1). But don't return that when the last byte is okay!)
What's the point of this?
The code sample above is for the idealist who does not want to write just a "UTF-8 support" but a "locale support" based on the C library. There might not have a point for this at all in 2021 :)

How to convert saved text file encoding to UTF8?

recently i saved a text file on my computer but when i open it again i saw some strings like:
"˜ÌÇí ÍÑÝã ÚÌíÈå¿"
now i want to know is it possible to reconvert it to the original text (UTF8)?
i try this codes but it doesn't works
string tempStr="˜ÌÇí ÍÑÝã ÚÌíÈå¿";
Encoding ANSI = Encoding.GetEncoding(1256);
byte[] ansiBytes = ANSI.GetBytes(tempStr);
byte[] utf8Bytes = Encoding.Convert(ANSI, Encoding.UTF8, ansiBytes);
String utf8String = Encoding.UTF8.GetString(utf8Bytes);

You can use something like:
string str = Encoding.GetEncoding(1256).GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(tempStr))
The string wasn't really decoded... Its bytes where simply "enlarged" to char, with something like:
byte[] bytes = ...
char[] chars = new char[bytes.Length];
for (int i = 0; i < bytes.Length; i++)
{
chars[i] = bytes[i];
}
string str = new string(chars);
Now... This transformation is the same that is done by the codepage ISO-8859-1. So I could simply have done the reverse, or I could have used that codepage to do it for me, I selected the second one.
Encoding.GetEncoding("iso-8859-1").GetBytes(tempStr)
this gave me the original byte[]
Then I've done some tests and it seems that the text in the beginning wasn't UTF8, it was in codepage 1256, that is an arabic codepage. So I
string str = Encoding.GetEncoding(1256).GetString(...);
The only thing, the ˜ doesn't seem to be part of the original string.
There is another possibility:
string str = Encoding.GetEncoding(1256).GetString(Encoding.GetEncoding(1252).GetBytes(tempStr));
The codepage 1252 is the codepage used in the USA and in a big part of Europe. If you have a Windows configured to English, there is a good chance it uses the 1252 as the default codepage. The result is slightly different than using the iso-8859-1

CRijndael only encrpyting first 32 bytes of longer string

I'm using CRijndael ( http://www.codeproject.com/Articles/1380/A-C-Implementation-of-the-Rijndael-Encryption-Decr ) for encryption using a null based iv (I know that's an issue but for certain reasons I'm stuck with having to use that).
For strings that are longer (or contain a few ampersands) I'm only ever getting the first 32 bytes encrypted. Shorter strings are encrypted without any issues. Code is below, any ideas?
char dataIn[] = "LONG STRING HERE";
string preInput = dataIn;
CRijndael aRijndael;
aRijndael.MakeKey("32-BIT-KEY-HERE", CRijndael::sm_chain0, 32, 16);
while (preInput.length() % 16 != 0) {
preInput += '\0';
}
const char *encInput = preInput.c_str();
char szReq[1000];
aRijndael.Encrypt(preInput.c_str(), szReq, preInput.size(), CRijndael::CBC);
const std::string preBase64 = szReq;
std::string encoded = base64_encode(reinterpret_cast<const unsigned char*>(preBase64.c_str()), preBase64.length());

Show regional characters in MessageBox without setting required locale

I need to show MessageBox with regional characters described in ISO/IEC 8859-13 codepage without setting Windows locale to this region. I was naive and tried to show ASCI table with one byte character:
void showCodePage()
{
char *a = new char[10000];
char *aa = new char[10000];
int f=0;
int lines =0;
for (int i=1; i<255;i++ )
{
sprintf(a,"%c %0.3d ",i,i);
sprintf(aa,"%c",i);
f++;
a+=6;
aa++;
if (f==8)
{
f=0;
sprintf(a,"%c",0x0d);
a++;
lines++;
}
}
*aa=0;
*a=0;
a-=254*6+lines;
aa-=254;
MessageBox(NULL, aa , "Hello!", MB_ICONEXCLAMATION | MB_OK);
MessageBox(NULL, a , "Hello!", MB_ICONEXCLAMATION | MB_OK);
delete [] a;
delete [] aa;
}
Ok, this doesn't shows ISO/IEC 8859-13 correctly and this is not possible without changing locale:
Now I decide to make unicode wstring. Converting from single byte char to unicode wchar function:
wstring convert( const std::string& as )
{
// deal with trivial case of empty string
if( as.empty() ) return std::wstring();
// determine required length of new string
size_t reqLength = ::MultiByteToWideChar( CP_UTF8, 0, as.c_str(), (int)as.length(), 0, 0 );
// construct new string of required length
std::wstring ret( reqLength, L'\0' );
// convert old string to new string
::MultiByteToWideChar( CP_UTF8, 0, as.c_str(), (int)as.length(), &ret[0], (int)ret.length() );
// return new string ( compiler should optimize this away )
return ret;
}
And changing MessageBox'es:
MessageBoxW(NULL, convert(aa).c_str(), L"Hello!", MB_ICONEXCLAMATION | MB_OK);
MessageBoxW(NULL, convert(a).c_str() , L"Hello!", MB_ICONEXCLAMATION | MB_OK);
Result is still sad:
In other hand what I was expecting? I need somehow tell system which code page it should use to display may characters. How to do that?

The problem with your solution is that the function MultiByteToWideChar with CodePage parameter CP_UTF8 deosn't translate your SPECIFIC ASCII code-page to UTF-16, it translates UTF-8 to UTF-16, which is not what you need.
What you're looking for is a translation table from chars in ISO/IEC 8859-13 to WideChar. You can manually make one from the table in https://en.wikipedia.org/wiki/ISO/IEC_8859-13, e.g. 160 becomes 00A0, 161 to 201D, and so on.

Unicode <-> Multibyte conversion (native vs. managed)

I'm trying to convert unicode strings coming from .NET to native C++ so that I can write them to a text file. The process shall then be reversed, so that the text from the file is read and converted to a managed unicode string.
I use the following code:
String^ FromNativeToDotNet(std::string value)
{
// Convert an ASCII string to a Unicode String
std::wstring wstrTo;
wchar_t *wszTo = new wchar_t[lvalue.length() + 1];
wszTo[lvalue.size()] = L'\0';
MultiByteToWideChar(CP_UTF8, 0, value.c_str(), -1, wszTo, (int)value.length());
wstrTo = wszTo;
delete[] wszTo;
return gcnew String(wstrTo.c_str());
}
std::string FromDotNetToNative(String^ value)
{
// Pass on changes to native part
pin_ptr<const wchar_t> wcValue = SafePtrToStringChars(value);
std::wstring wsValue( wcValue );
// Convert a Unicode string to an ASCII string
std::string strTo;
char *szTo = new char[wsValue.length() + 1];
szTo[wsValue.size()] = '\0';
WideCharToMultiByte(CP_UTF8, 0, wsValue.c_str(), -1, szTo, (int)wsValue.length(), NULL, NULL);
strTo = szTo;
delete[] szTo;
return strTo;
}
What happens is that e.g. a Japanese character gets converted to two ASCII chars (漢 -> "w). I assume that's correct?
But the other way does not work: when I call FromNativeToDotNet wizh "w I only get "w as a managed unicode string...
How can I get the Japanese character correctly restored?

Best to use UTF8Encoding:
static String^ FromNativeToDotNet(std::string value)
{
array<Byte>^ bytes = gcnew array<Byte>(value.length());
System::Runtime::InteropServices::Marshal::Copy(IntPtr((void*)value.c_str()), bytes, 0, value.length());
return (gcnew System::Text::UTF8Encoding)->GetString(bytes);
}
static std::string FromDotNetToNative(String^ value)
{
if (value->Length == 0) return std::string("");
array<Byte>^ bytes = (gcnew System::Text::UTF8Encoding)->GetBytes(value);
pin_ptr<Byte> chars = &bytes[0];
return std::string((char*)chars, bytes->Length);
}

a Japanese character gets converted to two ASCII chars (漢 -> "w). I assume that's correct?
No, that character, U+6F22, should be converted to three bytes: 0xE6 0xBC 0xA2
In UTF-16 (little endian) U+6F22 is stored in memory as 0x22 0x6F, which would look like "o in ascii (rather than "w) so it looks like something is wrong with your conversion from String^ to std::string.
I'm not familiar enough with String^ to know the right way to convert from String^ to std::wstring, but I'm pretty sure that's where your problem is.
I don't think the following has anything to do with your problem, but it is obviously wrong:
std::string strTo;
char *szTo = new char[wsValue.length() + 1];
You already know a single wide character can produce multiple narrow characters, so the number of wide characters is obviously not necessarily equal to or greater than the number of corresponding narrow characters.
You need to use WideCharToMultiByte to calculate the buffer size, and then call it again with a buffer of that size. Or you can just allocate a buffer to hold 3 times the number of chars as wide chars.

Try this instead:
String^ FromNativeToDotNet(std::string value)
{
// Convert a UTF-8 string to a UTF-16 String
int len = MultiByteToWideChar(CP_UTF8, 0, value.c_str(), value.length(), NULL, 0);
if (len > 0)
{
std::vector<wchar_t> wszTo(len);
MultiByteToWideChar(CP_UTF8, 0, value.c_str(), value.length(), &wszTo[0], len);
return gcnew String(&wszTo[0], 0, len);
}
return gcnew String((wchar_t*)NULL);
}
std::string FromDotNetToNative(String^ value)
{
// Pass on changes to native part
pin_ptr<const wchar_t> wcValue = SafePtrToStringChars(value);
// Convert a UTF-16 string to a UTF-8 string
int len = WideCharToMultiByte(CP_UTF8, 0, wcValue, str->Length, NULL, 0, NULL, NULL);
if (len > 0)
{
std::vector<char> szTo(len);
WideCharToMultiByte(CP_UTF8, 0, wcValue, str->Length, &szTo[0], len, NULL, NULL);
return std::string(&szTo[0], len);
}
return std::string();
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

VC++ Converting Unicode Traditional Chinese characters to multi byte not always work - visual-c++

Related

Last character of a multibyte string

How to convert saved text file encoding to UTF8?

CRijndael only encrpyting first 32 bytes of longer string

Show regional characters in MessageBox without setting required locale

Unicode <-> Multibyte conversion (native vs. managed)

Categories

Resources