Unicode <-> Multibyte conversion (native vs. managed) - string

I'm trying to convert unicode strings coming from .NET to native C++ so that I can write them to a text file. The process shall then be reversed, so that the text from the file is read and converted to a managed unicode string.
I use the following code:
String^ FromNativeToDotNet(std::string value)
{
// Convert an ASCII string to a Unicode String
std::wstring wstrTo;
wchar_t *wszTo = new wchar_t[lvalue.length() + 1];
wszTo[lvalue.size()] = L'\0';
MultiByteToWideChar(CP_UTF8, 0, value.c_str(), -1, wszTo, (int)value.length());
wstrTo = wszTo;
delete[] wszTo;
return gcnew String(wstrTo.c_str());
}
std::string FromDotNetToNative(String^ value)
{
// Pass on changes to native part
pin_ptr<const wchar_t> wcValue = SafePtrToStringChars(value);
std::wstring wsValue( wcValue );
// Convert a Unicode string to an ASCII string
std::string strTo;
char *szTo = new char[wsValue.length() + 1];
szTo[wsValue.size()] = '\0';
WideCharToMultiByte(CP_UTF8, 0, wsValue.c_str(), -1, szTo, (int)wsValue.length(), NULL, NULL);
strTo = szTo;
delete[] szTo;
return strTo;
}
What happens is that e.g. a Japanese character gets converted to two ASCII chars (漢 -> "w). I assume that's correct?
But the other way does not work: when I call FromNativeToDotNet wizh "w I only get "w as a managed unicode string...
How can I get the Japanese character correctly restored?

Best to use UTF8Encoding:
static String^ FromNativeToDotNet(std::string value)
{
array<Byte>^ bytes = gcnew array<Byte>(value.length());
System::Runtime::InteropServices::Marshal::Copy(IntPtr((void*)value.c_str()), bytes, 0, value.length());
return (gcnew System::Text::UTF8Encoding)->GetString(bytes);
}
static std::string FromDotNetToNative(String^ value)
{
if (value->Length == 0) return std::string("");
array<Byte>^ bytes = (gcnew System::Text::UTF8Encoding)->GetBytes(value);
pin_ptr<Byte> chars = &bytes[0];
return std::string((char*)chars, bytes->Length);
}

a Japanese character gets converted to two ASCII chars (漢 -> "w). I assume that's correct?
No, that character, U+6F22, should be converted to three bytes: 0xE6 0xBC 0xA2
In UTF-16 (little endian) U+6F22 is stored in memory as 0x22 0x6F, which would look like "o in ascii (rather than "w) so it looks like something is wrong with your conversion from String^ to std::string.
I'm not familiar enough with String^ to know the right way to convert from String^ to std::wstring, but I'm pretty sure that's where your problem is.
I don't think the following has anything to do with your problem, but it is obviously wrong:
std::string strTo;
char *szTo = new char[wsValue.length() + 1];
You already know a single wide character can produce multiple narrow characters, so the number of wide characters is obviously not necessarily equal to or greater than the number of corresponding narrow characters.
You need to use WideCharToMultiByte to calculate the buffer size, and then call it again with a buffer of that size. Or you can just allocate a buffer to hold 3 times the number of chars as wide chars.

Try this instead:
String^ FromNativeToDotNet(std::string value)
{
// Convert a UTF-8 string to a UTF-16 String
int len = MultiByteToWideChar(CP_UTF8, 0, value.c_str(), value.length(), NULL, 0);
if (len > 0)
{
std::vector<wchar_t> wszTo(len);
MultiByteToWideChar(CP_UTF8, 0, value.c_str(), value.length(), &wszTo[0], len);
return gcnew String(&wszTo[0], 0, len);
}
return gcnew String((wchar_t*)NULL);
}
std::string FromDotNetToNative(String^ value)
{
// Pass on changes to native part
pin_ptr<const wchar_t> wcValue = SafePtrToStringChars(value);
// Convert a UTF-16 string to a UTF-8 string
int len = WideCharToMultiByte(CP_UTF8, 0, wcValue, str->Length, NULL, 0, NULL, NULL);
if (len > 0)
{
std::vector<char> szTo(len);
WideCharToMultiByte(CP_UTF8, 0, wcValue, str->Length, &szTo[0], len, NULL, NULL);
return std::string(&szTo[0], len);
}
return std::string();
}

Related

which delimiter can I use safely to separate zlib deflated strings in node

I need to send content from a client to a remote server using node.js.
The content can be anything (a user can upload any file).
Each piece of content is compressed by zlib.deflate before sending it to the remote.
I prefer not to make multiple roundtrips and send the entire content at once.
To separate between each piece of content, I need a character that can't be used in the compressed string, so I can split it safely on the remote.
There is no such character or sequence of characters. zlib compressed data can contain any sequence of bytes.
You could encode the zlib compressed data to avoid one byte value, expanding compressed data slightly. Then you could use that one byte value as a delimiter.
Example code:
// Example of encoding binary data to a sequence of bytes with no zero values.
// The result is expanded slightly. On average, assuming random input, the
// expansion is less than 0.1%. The maximum expansion is less than 14.3%, which
// is reached only if the input is a sequence of bytes all with value 255.
#include <stdio.h>
// Encode binary data read from in, to a sequence of byte values in 1..255
// written to out. There will be no zero byte values in the output. The
// encoding is decoding a flat (equiprobable) Huffman code of 255 symbols.
void no_zeros_encode(FILE *in, FILE *out) {
unsigned buf = 0;
int bits = 0, ch;
do {
if (bits < 8) {
ch = getc(in);
if (ch != EOF) {
buf += (unsigned)ch << bits;
bits += 8;
}
else if (bits == 0)
break;
}
if ((buf & 127) == 127) {
putc(255, out);
buf >>= 7;
bits -= 7;
}
else {
unsigned val = buf & 255;
buf >>= 8;
bits -= 8;
if (val < 127)
val++;
putc(val, out);
}
} while (ch != EOF);
}
// Decode a sequence of byte values made by no_zeros_encode() read from in, to
// the original binary data written to out. The decoding is encoding a flat
// Huffman code of 255 symbols. no_zeros_encode() will not generate any zero
// byte values in its output (that's the whole point), but if there are any
// zeros in the input to no_zeros_decode(), they are ignored.
void no_zeros_decode(FILE *in, FILE *out) {
unsigned buf = 0;
int bits = 0, ch;
while ((ch = getc(in)) != EOF)
if (ch != 0) { // could flag any zeros as an error
if (ch == 255) {
buf += 127 << bits;
bits += 7;
}
else {
if (ch <= 127)
ch--;
buf += (unsigned)ch << bits;
bits += 8;
}
if (bits >= 8) {
putc(buf, out);
buf >>= 8;
bits -= 8;
}
}
}

VC++ Converting Unicode Traditional Chinese characters to multi byte not always work

My application (MFC) is an Unicode app, and I have a third party DLL which only takes the multi-byte characters, so I have to convert the Unicode string to the multi-byte string to pass it to the 3rd party app. Korean, Japanese and even Simplified chinese strings were converted correctly except Traditional chinese. Below describes my attempt. This CPP is encoded in Unicode.
CString strFilePath(_T("中文字 深水埗.docx"));// (_T("中文字 深水埗.docx"));
wchar_t tcharPath[260];
wcscpy(tcharPath, (LPCTSTR)(strFilePath));
CString strAll = strFilePath;
int strAllLength = strAll.GetLength() + 1;
int nSize = 0;
char * pszBuf;
CPINFO pCPInfo;
BOOL bUsedDefaultChar = FALSE;
int nRC = GetCPInfo( CP_ACP, &pCPInfo );
if ((nSize = WideCharToMultiByte( CP_ACP, 0, strAll, strAllLength, NULL, 0, NULL, NULL )) > 0)
{ // Get the size of the buffer length
pszBuf = (char*)calloc(nSize + 1, sizeof(char)); // allocate the buffer
if (pszBuf == NULL)
return; // no more memory
nRC = WideCharToMultiByte( CP_ACP, 0, strAll, strAll.GetLength(), pszBuf, nSize+1, NULL, &bUsedDefaultChar ); // store Unicode chars to pszBuf
DWORD dwErr = GetLastError();
::MessageBoxA( NULL, pszBuf, "", MB_OK );
free(pszBuf); // free it.
}
With Simplified Chinese Windows, the above 6 chinese characters were displayed correctly. Unfortunately, in Traditional Chinese Windows, the 6th character "埗" couldn't be, so it was converted to "?".
Can anyone explain why and tell me if it is possible to convert correctly?

CRijndael only encrpyting first 32 bytes of longer string

I'm using CRijndael ( http://www.codeproject.com/Articles/1380/A-C-Implementation-of-the-Rijndael-Encryption-Decr ) for encryption using a null based iv (I know that's an issue but for certain reasons I'm stuck with having to use that).
For strings that are longer (or contain a few ampersands) I'm only ever getting the first 32 bytes encrypted. Shorter strings are encrypted without any issues. Code is below, any ideas?
char dataIn[] = "LONG STRING HERE";
string preInput = dataIn;
CRijndael aRijndael;
aRijndael.MakeKey("32-BIT-KEY-HERE", CRijndael::sm_chain0, 32, 16);
while (preInput.length() % 16 != 0) {
preInput += '\0';
}
const char *encInput = preInput.c_str();
char szReq[1000];
aRijndael.Encrypt(preInput.c_str(), szReq, preInput.size(), CRijndael::CBC);
const std::string preBase64 = szReq;
std::string encoded = base64_encode(reinterpret_cast<const unsigned char*>(preBase64.c_str()), preBase64.length());

Show regional characters in MessageBox without setting required locale

I need to show MessageBox with regional characters described in ISO/IEC 8859-13 codepage without setting Windows locale to this region. I was naive and tried to show ASCI table with one byte character:
void showCodePage()
{
char *a = new char[10000];
char *aa = new char[10000];
int f=0;
int lines =0;
for (int i=1; i<255;i++ )
{
sprintf(a,"%c %0.3d ",i,i);
sprintf(aa,"%c",i);
f++;
a+=6;
aa++;
if (f==8)
{
f=0;
sprintf(a,"%c",0x0d);
a++;
lines++;
}
}
*aa=0;
*a=0;
a-=254*6+lines;
aa-=254;
MessageBox(NULL, aa , "Hello!", MB_ICONEXCLAMATION | MB_OK);
MessageBox(NULL, a , "Hello!", MB_ICONEXCLAMATION | MB_OK);
delete [] a;
delete [] aa;
}
Ok, this doesn't shows ISO/IEC 8859-13 correctly and this is not possible without changing locale:
Now I decide to make unicode wstring. Converting from single byte char to unicode wchar function:
wstring convert( const std::string& as )
{
// deal with trivial case of empty string
if( as.empty() ) return std::wstring();
// determine required length of new string
size_t reqLength = ::MultiByteToWideChar( CP_UTF8, 0, as.c_str(), (int)as.length(), 0, 0 );
// construct new string of required length
std::wstring ret( reqLength, L'\0' );
// convert old string to new string
::MultiByteToWideChar( CP_UTF8, 0, as.c_str(), (int)as.length(), &ret[0], (int)ret.length() );
// return new string ( compiler should optimize this away )
return ret;
}
And changing MessageBox'es:
MessageBoxW(NULL, convert(aa).c_str(), L"Hello!", MB_ICONEXCLAMATION | MB_OK);
MessageBoxW(NULL, convert(a).c_str() , L"Hello!", MB_ICONEXCLAMATION | MB_OK);
Result is still sad:
In other hand what I was expecting? I need somehow tell system which code page it should use to display may characters. How to do that?
The problem with your solution is that the function MultiByteToWideChar with CodePage parameter CP_UTF8 deosn't translate your SPECIFIC ASCII code-page to UTF-16, it translates UTF-8 to UTF-16, which is not what you need.
What you're looking for is a translation table from chars in ISO/IEC 8859-13 to WideChar. You can manually make one from the table in https://en.wikipedia.org/wiki/ISO/IEC_8859-13, e.g. 160 becomes 00A0, 161 to 201D, and so on.

Converting Byte Array to String (NXC)

Is there a way to show a byte array on the NXTscreen (using NXC)?
I've tried like this:
unsigned char Data[];
string Result = ByteArrayToStr(Data[0]);
TextOut(0, 0, Result);
But it gives me a File Error! -1.
If this isn't possible, how can I watch the value of Data[0] during the program?
If you want to show the byte array in hexadecimal format, you can do this:
byte buf[];
unsigned int buf_len = ArrayLen(buf);
string szOut = "";
string szTmp = "00";
// Convert to hexadecimal string.
for(unsigned int i = 0; i < buf_len; ++i)
{
sprintf(szTmp, "%02X", buf[i]);
szOut += szTmp;
}
// Display on screen.
WordWrapOut(szOut,
0, 63,
NULL, WORD_WRAP_WRAP_BY_CHAR,
DRAW_OPT_CLEAR_WHOLE_SCREEN);
You can find WordWrapOut() here.
If you simply want to convert it to ASCII:
unsigned char Data[];
string Result = ByteArrayToStr(Data);
TextOut(0, 0, Result);
If you only wish to display one character:
unsigned char Data[];
string Result = FlattenVar(Data[0]);
TextOut(0, 0, Result);
Try byte. byte is an unsigned char in NXC.
P.S. There is a heavily-under-development debugger in BricxCC (I assume you're on windows). Look here.
EDIT: The code compiles and runs, but does not do anything.

Resources