Converting wchar_t* to char* on iOS - string

I'm attempting to convert a wchar_t* to a char*. Here's my code:
size_t result = wcstombs(returned, str, length + 1);
if (result == (size_t)-1) {
int error = errno;
}
It indeed fails, and error is filled with 92 (ENOPROTOOPT) - Protocol not available.
I've even tried setting the locale:
setlocale(LC_ALL, "C");
And this one too:
setlocale(LC_ALL, "");
I'm tempted to just throw the characters with static casts!

Seems the issue was that the source string was encoded with a non-standard encoding (two ASCII characters for each wide character), which looked fine in the debugger, but clearly internally was sour. The error code produced is clearly not documented, but it's the equivalent to simply not being able to decode said piece of text.

Related

CString to UTF8 conversion fails for "ý"

In my application I want to convert a string that contains character ý, to UTF-8. But its not giving the exact result.
I am using WideCharToMultiByte function, it is converting the purticular character to ý.
For Example :
Input - "ý"
Output - "ý"
Please see the code below..
String strBuffer("ý" );
char *utf8Buffer = (char*)malloc(strBuffer.GetLength()+1);
int utf8bufferLength = WideCharToMultiByte(CP_UTF8, 0, (LPCWSTR)strBuffer.GetBuffer(strBuffer.GetLength() + 1)),
strBuffer.GetLength(), utf8Buffer, strBuffer.GetLength() * 4,0,0);
Please give your suggestions...
Binoy Krishna
Unicode codepoint for letter ý, according to this page is 25310 or FD16. UTF-8 representation is 195 189 decimal or C3 BD hexadecimal. These two bytes can be seen as letters ý in your program and/or debugger, but they are UTF-8 numbers, so they are bytes, not letters.
In another words the output and the code are fine, and your expectations are wrong. I can't say why are they wrong because you haven't mentioned what exactly were you expecting.
EDIT: The code should be improved. See Rudolfs' answer for more info.
While I was writing this an answer explaining the character values you are seeing was already posted, however, there are two things to mention about your code:
1) you should use the _T() macro when initializing the string: CString strBuffer(_T("ý")); The _T() macro is defined in tchar.h and maps to the correct string type depending on the value of the _UNICODE macro.
2) do not use the GetLength() to calculate the size of the UTF-8 buffer, see the documentation of WideCharToMultiByte in MSDN, it shows how to use the function to calculate the needed length for the UTF-8 buffer in the comments section.
Here is a small example that verifies the output according to the codepoints and demonstrates how to use the automatic length calculation:
#define _AFXDLL
#include <afx.h>
#include <iostream>
int main(int argc, char** argv)
{
CString wideStrBuffer(_T("ý"));
// The length calculation assumes wideStrBuffer is zero terminated
CStringA utf8Buffer('\0', WideCharToMultiByte(CP_UTF8, 0, wideStrBuffer.GetBuffer(), -1, NULL, 0, NULL, NULL));
WideCharToMultiByte(CP_UTF8, 0, wideStrBuffer.GetBuffer(), -1, utf8Buffer.GetBuffer(), utf8Buffer.GetLength(), NULL, NULL);
if (static_cast<unsigned char>(utf8Buffer[0]) == 195 && static_cast<unsigned char>(utf8Buffer[1]) == 189)
{
std::cout << "Conversion successful!" << std::endl;
}
return 0;
}

How to know if wstring can be safely (no data loss) converted to string?

So I already know how to convert wstring to string (How to convert wstring into string?).
However, I would like to to know whether it is safe to make the conversion, meaning, the wstring variable does not contain any characters that are not supported in string type.
strings can hold any data, if you use the right encoding. They are just sequences of bytes. But you need to check with your particular encoding / conversion routine.
Should be simply a matter of round-tripping. An elegant solution to many things.
Warning, Pseudo-code, there is no literal convert_to_wstring() unless you make it so:
if(convert_to_wstring(convert_to_string(ws)) == ws)
happy_days();
If what goes in comes out, it is non-lossy (at least for your code points).
Not that its the most efficient solution, but should allow you to build from your favorite conversion routines.
// Round-trip and see if we lose anything
bool check_ws2s(const std::wstring& wstr)
{
return (s2ws(ws2s(str)) == wstr);
}
Using #dk123's conversions for C++11 at How to convert wstring into string? (Upvote his answer here https://stackoverflow.com/a/18374698/257090)
wstring s2ws(const std::string& str)
{
typedef std::codecvt_utf8<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
return converterX.from_bytes(str);
}
string ws2s(const std::wstring& wstr)
{
typedef std::codecvt_utf8<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
return converterX.to_bytes(wstr);
}
Note, if your idea of conversion is truncating the wide chars to chars, then it is simply a matter of iterating and checking that each wide char value fits in a char. This will probably do it.
WARNING: Not appropriate for multibyte encoding.
for(wchar_t& wc: ws) {
if(wc > static_cast<char>::(wc))
return false;
}
return true;
Or:
// Could use a narrowing cast comparison, but this avoids any warnings
for(wchar_t& wc: ws) {
if(wc > std::numeric_limits<char>::max())
return false;
}
return true;
FWIW, in Win32, there are conversion routines that accept a parameter of WC_ERR_INVALID_CHARS that tells the routine to fail instead of silently dropping code points. Non-standard solutions, of course.
Example: WideCharToMultiByte()
http://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx

Why am I getting gibberish output, along with valid output, when reading a String^ ?

I'm trying to write a few integers to a file (as a string.) Every time I try to run this bit of code I get the integers into the text file like planned, but before the integers, I get some gibberish. I did some experimenting, and found out that if I put nothing into System::String ^ b, it would give the same gibberish output into the file or a message box, but I couldn't figure out why it would do this if I was concatenating those integers to it (as strings). What could be going wrong here?
using namespace msclr::interop;
using namespace System;
using namespace System::IO;
using namespace System::Text;
...
System::IO::StreamWriter ^ x;
char buffer[21], buffer2[3];
int a;
for(a = 0; a < 10; a++){
itoa(weight[a], buffer, 10);
strcat(buffer, buffer2);
}
System::String ^ b = marshal_as<String^>(buffer);
x->WriteLine(b);
What format is the file in? You may be reading a UTF-8 file with a byte-order mark that is silently applied by a text editing program.
http://en.wikipedia.org/wiki/Byte_order_mark
Typo in question or bug in code: pass buffer2 to itoa instead of buffer.
Also, initialize buffer to "";

C++'s char * by swig got problem in Python 3.0

Our C++ lib works fine with Python2.4 using Swig, returning a C++ char* back to a python str. But this solution hit problem in Python3.0, error is:
Exception=(, UnicodeDecodeError('utf8', b"\xb6\x9d\xa.....",0, 1, 'unexpected code byte')
Our definition is like(working fine in Python 2.4):
void cGetPubModulus(
void* pSslRsa,
char* cMod,
int* nLen );
%include "cstring.i"
%cstring_output_withsize( char* cMod, int* nLen );
Suspect swig is doing a Bytes->Str conversion automatically. In python2.4 it can be implicit but in Python3.0 it's no long allowed.. Anyone got a good idea? thanks
It's rather Python 3 that does that conversion. In Python 2 bytes and str are the same thing, in Python 3 str is unicode, so something somewhere tries to convert it to Unicode with UTF8, but it's not UTF8.
Your Python 3 code needs to return not a Python str, but a Python bytes. This will not work with Python 2, though, so you need preprocessor statements to handle the differences.
I came across a similar problem. I wrote a SWIG typemap for a custom char array (an unsigned char in fact) and it got SEGFAULT when using Python 3. So I debugged the code within the typemap and I realized the problem Lennart states.
My solution to that problem was doing the following in that typemap:
%typemap(in) byte_t[MAX_FONTFACE_LEN] {
if (PyString_Check($input))
{
$1 = (byte_t *)PyString_AsString($input);
}
else if (PyUnicode_Check($input))
{
$1 = (byte_t *)PyUnicode_AsEncodedString($input, "utf-8", "Error ~");
$1 = (byte_t *)PyBytes_AS_STRING($1);
}
else
{
PyErr_SetString(PyExc_TypeError,"Expected a string.");
return NULL;
}
}
That is, I check what kind of string object PyObject is. The functions PyString_AsString() and PyUnicode_AsString() will return > 0 if its input it's an UTF- 8 string or an Unicode string respectively. If it's an Unicode string, we convert that string to bytes in the call PyUnicode_AsEncodedString() and later on we convert those bytes to a char * with the call PyBytes_AS_STRING().
Note that I vaguely use the same variable for storing the unicode string and converting it later to bytes. Despite of being that questionable and maybe, it could derive in another coding-style discussion, the fact is that I solved my problem. I have tested it out with python3 and python2.7 binaries without any problems yet.
And lastly, the last line is for replicating an exception in the python call, to inform about that input wasn't a string, either utf nor unicode.

LPCSTR, TCHAR, String

I am use next type of strings:
LPCSTR, TCHAR, String i want to convert:
from TCHAR to LPCSTR
from String to char
I convert from TCHAR to LPCSTR by that code:
RunPath = TEXT("C:\\1");
LPCSTR Path = (LPCSTR)RunPath;
From String to char i convert by that code:
SaveFileDialog^ saveFileDialog1 = gcnew SaveFileDialog;
saveFileDialog1->Title = "Сохранение файла-настроек";
saveFileDialog1->Filter = "bck files (*.bck)|*.bck";
saveFileDialog1->RestoreDirectory = true;
pin_ptr<const wchar_t> wch = TEXT("");
if ( saveFileDialog1->ShowDialog() == System::Windows::Forms::DialogResult::OK ) {
wch = PtrToStringChars(saveFileDialog1->FileName);
} else return;
ofstream os(wch, ios::binary);
My problem is that when i set "Configuration Properties -> General
Character Set in "Use Multi-Byte Character Set" the first part of code work correctly. But the second part of code return error C2440. When i set "Configuration Properties -> General
Character Set in "Use Unicode" the second part of code work correctly. But the first part of code return the only first character from TCHAR to LPCSTR.
I'd suggest you need to be using Unicode the whole way through.
LPCSTR is a "Long Pointer to a C-type String". That's typically not what you want when you're dealing with .Net methods. The char type in .Net is 16bits wide.
You also should not use the TEXT("") macro unless you're planning multiple builds using various character encodings. Try wrapping all your string literals with the _W("") macro instead and a pure unicode build if you can.
See if that helps.
PS. std::wstring is very handy in your scenario.
EDIT
You see only one character because the string is now unicode but you cast it as a regular string. Many or most of the Unicode characters in the ASCII range has their same number as in ASCII but have the second of their 2 bytes set to zero. So when a unicode string is read as a C-string you only see the first character because C-strings are null ( zero ) terminated. The easy ( and wrong ) way to deal with this is to use std:wstring to cast as a std:string then pull the C-String out of that. This is not the safe approach because Unicode has a much large character space then your standard encoding.

Resources