CString to UTF8 conversion fails for "ý"

CString to UTF8 conversion fails for "ý" - visual-c++

In my application I want to convert a string that contains character ý, to UTF-8. But its not giving the exact result.
I am using WideCharToMultiByte function, it is converting the purticular character to Ã½.
For Example :
Input - "ý"
Output - "Ã½"
Please see the code below..
String strBuffer("ý" );
char *utf8Buffer = (char*)malloc(strBuffer.GetLength()+1);
int utf8bufferLength = WideCharToMultiByte(CP_UTF8, 0, (LPCWSTR)strBuffer.GetBuffer(strBuffer.GetLength() + 1)),
strBuffer.GetLength(), utf8Buffer, strBuffer.GetLength() * 4,0,0);
Please give your suggestions...
Binoy Krishna

Unicode codepoint for letter ý, according to this page is 25310 or FD16. UTF-8 representation is 195 189 decimal or C3 BD hexadecimal. These two bytes can be seen as letters Ã½ in your program and/or debugger, but they are UTF-8 numbers, so they are bytes, not letters.
In another words the output and the code are fine, and your expectations are wrong. I can't say why are they wrong because you haven't mentioned what exactly were you expecting.
EDIT: The code should be improved. See Rudolfs' answer for more info.

While I was writing this an answer explaining the character values you are seeing was already posted, however, there are two things to mention about your code:
1) you should use the _T() macro when initializing the string: CString strBuffer(_T("ý")); The _T() macro is defined in tchar.h and maps to the correct string type depending on the value of the _UNICODE macro.
2) do not use the GetLength() to calculate the size of the UTF-8 buffer, see the documentation of WideCharToMultiByte in MSDN, it shows how to use the function to calculate the needed length for the UTF-8 buffer in the comments section.
Here is a small example that verifies the output according to the codepoints and demonstrates how to use the automatic length calculation:
#define _AFXDLL
#include <afx.h>
#include <iostream>
int main(int argc, char** argv)
{
CString wideStrBuffer(_T("ý"));
// The length calculation assumes wideStrBuffer is zero terminated
CStringA utf8Buffer('\0', WideCharToMultiByte(CP_UTF8, 0, wideStrBuffer.GetBuffer(), -1, NULL, 0, NULL, NULL));
WideCharToMultiByte(CP_UTF8, 0, wideStrBuffer.GetBuffer(), -1, utf8Buffer.GetBuffer(), utf8Buffer.GetLength(), NULL, NULL);
if (static_cast<unsigned char>(utf8Buffer[0]) == 195 && static_cast<unsigned char>(utf8Buffer[1]) == 189)
{
std::cout << "Conversion successful!" << std::endl;
}
return 0;
}

Related

C Function to return a String resulting in corrupted top size

I am trying to write a program that calls upon an [external library (?)] (I'm not sure that I'm using the right terminology here) that I am also writing to clean up a provided string. For example, if my main.c program were to be provided with a string such as:
asdfFAweWFwseFL Wefawf JAWEFfja FAWSEF
it would call upon a function in externalLibrary.c (lets call it externalLibrary_Clean for now) that would take in the string, and return all characters in upper case without spaces:
ASDFFAWEWFWSEFLWEFAWFJAWEFFJAFAWSEF
The crazy part is that I have this working... so long as my string doesn't exceed 26 characters in length. As soon as I add a 27th character, I end up with an error that says
malloc(): corrupted top size.
Here is externalLibrary.c:
#include "externalLibrary.h"
#include <ctype.h>
#include <malloc.h>
#include <assert.h>
#include <string.h>
char * restrict externalLibrary_Clean(const char* restrict input) {
// first we define the return value as a pointer and initialize
// an integer to count the length of the string
char * returnVal = malloc(sizeof(input));
char * initialReturnVal = returnVal; //point to the start location
// until we hit the end of the string, we use this while loop to
// iterate through it
while (*input != '\0') {
if (isalpha(*input)) { // if we encounter an alphabet character (a-z/A-Z)
// then we convert it to an uppercase value and point our return value at it
*returnVal = toupper(*input);
returnVal++; //we use this to move our return value to the next location in memory
}
input++; // we move to the next memory location on the provided character pointer
}
*returnVal = '\0'; //once we have exhausted the input character pointer, we terminate our return value
return initialReturnVal;
}
int * restrict externalLibrary_getFrequencies(char * ar, int length){
static int freq[26];
for (int i = 0; i < length; i++){
freq[(ar[i]-65)]++;
}
return freq;
}
the header file for it (externalLibrary.h):
#ifndef LEARNINGC_EXTERNALLIBRARY_H
#define LEARNINGC_EXTERNALLIBRARY_H
#ifdef __cplusplus
extern "C" {
#endif
char * restrict externalLibrary_Clean(const char* restrict input);
int * restrict externalLibrary_getFrequencies(char * ar, int length);
#ifdef __cplusplus
}
#endif
#endif //LEARNINGC_EXTERNALLIBRARY_H
my main.c file from where all the action is happening:
#include <stdio.h>
#include "externalLibrary.h"
int main() {
char * unfilteredString = "ASDFOIWEGOASDGLKASJGISUAAAA";//if this exceeds 26 characters, the program breaks
char * cleanString = externalLibrary_Clean(unfilteredString);
//int * charDist = externalLibrary_getFrequencies(cleanString, 25); //this works just fine... for now
printf("\nOutput: %s\n", unfilteredString);
printf("\nCleaned Output: %s\n", cleanString);
/*for(int i = 0; i < 26; i++){
if(charDist[i] == 0){
}
else {
printf("%c: %d \n", (i + 65), charDist[i]);
}
}*/
return 0;
}
I'm extremely well versed in Java programming and I'm trying to translate my knowledge over to C as I wish to learn how my computer works in more detail (and have finer control over things such as memory).
If I were solving this problem in Java, it would be as simple as creating two class files: one called main.java and one called externalLibrary.java, where I would have static String Clean(string input) and then call upon it in main.java with String cleanString = externalLibrary.Clean(unfilteredString).
Clearly this isn't how C works, but I want to learn how (and why my code is crashing with corrupted top size)

The bug is this line:
char * returnVal = malloc(sizeof(input));
The reason it is a bug is that it requests an allocation large enough space to store a pointer, meaning 8 bytes in a 64-bit program. What you want to do is to allocate enough space to store the modified string, which you can do with the following line:
char *returnVal = malloc(strlen(input) + 1);
So the other part of your question is why the program doesn't crash when your string is less than 26 characters. The reason is that malloc is allowed to give the caller slightly more than the caller requested.
In your case, the message "malloc(): corrupted top size" suggests that you are using libc malloc, which is the default on Linux. That variant of malloc, in a 64-bit process, would always give you at least 0x18 (24) bytes (minimum chunk size 0x20 - 8 bytes for the size/status). In the specific case that the allocation immediately precedes the "top" allocation, writing past the end of the allocation will clobber the "top" size.
If your string is larger than 23 (0x17) you will start to clobber the size/status of the subsequent allocation because you also need 1 byte to store the trailing NULL. However, any string 23 characters or shorter will not cause a problem.
As to why you didn't get an error with a string with 26 characters, to answer that one would have to see that exact program with the string of 26 characters that does not crash to give a more precise answer. For example, if the program provided a 26-character input that contained 3 blanks, this would would require only 26 + 1 - 3 = 24 bytes in the allocation, which would fit.
If you are not interested in that level of detail, fixing the malloc call to request the proper amount will fix your crash.

Logical Error in C++ Hexadecimal Converter Code

I've been working on this Hexadecimal Converter and there seems to be a logical error somewhere in the program. I've run it on Ubuntu using the g++ tool and every time I run t program, it gives me a massive heap of garbage values. I can't figure out the source of the garbage values and neither can I find the source of the logical error. I'm a newbie at programming, so please help me figure out my mistake.
#include <iostream>
#include <math.h>
using namespace std;
int main()
{
int bin[20],finhex[10],num,bc=0,i,j,k,l=0,r=10,n=1,binset=0,m=0;
int hex[16]= {0000,0001,0010,0011,0100,0101,0110,0111,1000,1001,1010,1011,1100,1101,1110,1111};
char hexalph='A';
cout<<"\nEnter your Number: ";
cin>>num;
while(num>0)
{
bin[bc]=num%2;
num=num/2;
bc++;
}
if(bc%4!=0)
bc++;
for(j=0;j<bc/4;j++)
for(i=0;i<4;i++)
{
binset=binset+(bin[m]*pow(10,i));
m++;
}
for(k=0;k<16;k++)
{
if(hex[k]==binset)
{
if(k<=9)
finhex[l]=k;
else
while(n>0)
{
if(k==r)
{
finhex[l]=hexalph;
break;
}
else
{
hexalph++;
r++;
}
}
l++;
r=10;
binset=0;
hexalph='A';
break;
}
}
while(l>=0)
{
cout<<"\n"<<finhex[l];
l--;
}
return 0;
}

int hex[16]= {0000,0001,0010,0011,0100,0101,0110,0111,1000,1001,1010,1011,1100,1101,1110,1111};
Allow me to translate those values into decimal for you:
int hex[16] = {0, 1, 8, 9, 64, 65, 72, 73, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111};
If you want them to be considered binary literals then you need to either specify them as such or put them in some other form that the compiler understands:
int hex[16] = {0b0000, 0b0001, 0b0010, 0b0011, 0b0100, 0b0101, 0b0110, 0b0111, 0b1000, 0b1001, 0b1010, 0b1011, 0b1100, 0b1101, 0b1110, 0b1111};
int hex[16] = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf};

While Ignacio Vazquez-Abrams rightly hinted at the fact that some of your initializers of hex are (due to the prefix 0) octal constants, he overlooked that you chose the unusual, but possible way of representing binary literals as decimal constants with only the digits 0 and 1. Thus, you only have to remove the prefix 0 from all constants greater than 7:
int hex[16] = {0000,0001,10,11,100,101,110,111,1000,1001,1010,1011,1100,1101,1110,1111};
Then, you stored the characters 'A' etc. in int finhex[] and did output them with cout<<"\n"<<finhex[l] - but this way not A is printed, but rather its character code value, e. g. in ASCII 65. In order to really output the character A etc., we could change the finhex array element type to char:
char bin[20],finhex[10]; int num,bc=0,i,j,k,l=0,r=10,n=1,binset=0,m=0;
- but consequently we also have to store the digits 0 to 9 as their character code values:
if (k<=9)
finhex[l]='0'+k;
Furthermore, with the lines
if(bc%4!=0)
bc++;
you rightly pondered on the need to have a multiple of 4 bits for the conversion, but you overlooked that more than one bit could be missing, and also that the additional elements of bin[] are uninitialized, so change to:
while (bc%4!=0) bin[bc++] = 0;
Besides, you omitted block braces around the (appropriately indented) two inner for loops; since C++ is not Python, the indentation has no significance and without surrounding braces only the first of the indented for loops is nested into the outer for loop.
The final while loop should be outdented and go outside the big for loop. There's also an indexing error in it, as the finhex array is indexed with an l which is by one to high; you could change this to:
while (l--) cout<<finhex[l];
cout<<"\n";

Converting wchar_t* to char* on iOS

I'm attempting to convert a wchar_t* to a char*. Here's my code:
size_t result = wcstombs(returned, str, length + 1);
if (result == (size_t)-1) {
int error = errno;
}
It indeed fails, and error is filled with 92 (ENOPROTOOPT) - Protocol not available.
I've even tried setting the locale:
setlocale(LC_ALL, "C");
And this one too:
setlocale(LC_ALL, "");
I'm tempted to just throw the characters with static casts!

Seems the issue was that the source string was encoded with a non-standard encoding (two ASCII characters for each wide character), which looked fine in the debugger, but clearly internally was sour. The error code produced is clearly not documented, but it's the equivalent to simply not being able to decode said piece of text.

Convert hex to int

I've seen lots of answers to this, but I cannot seem to get any to work. I think I'm getting confused between variable types. I have an input from NetworkStream that is put a hex code into a String^. I need to take part of this string, convert it to a number (presumably int) so I can add some arithemetic, then output the reult on the form. The code I have so far:
String^ msg; // gets filled later, e.g. with "A55A6B0550000000FFFBDE0030C8"
String^ test;
//I have selected the relevant part of the string, e.g. 5A
test = msg->Substring(2, 2);
//I have tried many different routes to extract the numverical value of the
//substring. Below are some of them:
std::stringstream ss;
hexInt = 0;
//Works if test is string, not String^ but then I can't output it later.
ss << sscanf(test.c_str(), "%x", &hexInt);
//--------
sprintf(&hexInt, "%d", test);
//--------
//And a few others that I've deleted after they don't work at all.
//Output:
this->textBox1->AppendText("Display numerical value after a bit of math");
Any help with this would be greatly appreciated.
Chris

Does this help?
String^ hex = L"5A";
int converted = System::Convert::ToInt32(hex, 16);
The documentation for the Convert static method used is on the MSDN.
You need to stop thinking about using the standard C++ library with managed types. The .Net BCL is really very good...

Hope this helps:
/*
the method demonstrates converting hexadecimal values,
which are broken into low and high bytes.
*/
int main(){
//character buffer
char buf[1];
buf[0]= 0x06; //buffer initialized to some hex value
buf[1]= 0xAE; //buffer initialized to some hex value
int number=0;
//number generated by binary shift of high byte and its OR with low byte
number = 0xFFFF&((buf[1]<<8)|buf[0]);
printf("%x",number); //this prints AE06
printf(“%d”,number); //this prints the integer equivalent
getch();
}

Parsing a string with varying number of whitespace characters in C

I'm pretty new to C, and trying to write a function that will parse a string such as:
"This (5 spaces here) is (1 space
here) a (2 spaces here) string."
The function header would have a pointer to the string passed in such as:
bool Class::Parse( unsigned char* string )
In the end I'd like to parse each word regardless of the number of spaces between words, and store the words in a dynamic array.
Forgive the silly questions...
But what would be the most efficient way to do this if I am iterating over each character? Is that how strings are stored? So if I was to start iterating with:
while ( (*string) != '\0' ) {
--print *string here--
}
Would that be printing out
T
h
i... etc?
Thank you very much for any help you can provide.

from http://www.cplusplus.com/reference/clibrary/cstring/strtok/
/* strtok example */
#include <stdio.h>
#include <string.h>
int main ()
{
char str[] ="- This, a sample string.";
char * pch;
printf ("Splitting string \"%s\" into tokens:\n",str);
pch = strtok (str," ,.-"); /* split the string on these delimiters into "tokens" */
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ,.-"); /* split the string on these delimiters into "tokens" */
}
return 0;
}
Splitting string "- This, a sample string." into tokens:
This
a
sample
string

First of all, C does not have classes, so in a C program you would probably define your function with a prototype more like one of the following:
char ** my_prog_parse(char * string) {
/* (returns a malloc'd array of pointers into the original string, which has had
* \0 added throughout ) */
char ** my_prog_parse(const char * string) {
/* (returns a malloc'd NULL-terminated array of pointers to malloc'd strings) */
void my_prog_parse(const char * string, char buf, size_t bufsiz,
char ** strings, size_t nstrings)
/* builds a NULL-terminated array of pointers into buf, all memory
provided by caller) */
However, it is perfectly possible to use C-style strings in C++...
You could write your loop as
while (*string) { ... ; string++; }
and it will compile to exactly the same assembler on a modern optimizing compiler. yes, that is a correct way to iterate through a C-style string.
Take a look at the functions strtok, strchr, strstr, and strspn... one of them may help you build a solution.

I wouldn't do any non-trivial parsing in C, it's too laborious, the language is not suitable for that. But if you mean C++, and it looks like you do, since you wrote Class::Parse, then writing recursive descent parsers is pretty easy, and you don't need to reinvent the wheel. You can take Spirit for example, or AXE, if you compiler supports C++0x. For example, your parser in AXE can be written in few lines:
// assuming you have 0-terminated string
bool Class::Parse(const char* str)
{
auto space = r_lit(' ');
auto string_rule = "This" & r_many(space, 5) & space & 'a' & r_many(space, 2)
& "string" & r_end();
return string_rule(str, str + strlen(str)).matched;
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string